[jira] [Commented] (HDFS-11741) Long running balancer may fail due to expired DataEncryptionKey

Wei-Chiu Chuang (JIRA) Wed, 03 May 2017 11:00:31 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995341#comment-15995341
 ]


Wei-Chiu Chuang commented on HDFS-11741:
----------------------------------------

Hi Andrew, thanks for the review!

I just realized a client side BlockTokenSecretManager generates 
DataEncryptionKey expiration time using now + token life time. I am not sure if 
that's intended, as I would have assumed the key expiration time equals the 
current BlockKey expiration time (which is determined by NameNode).

So it is entirely possible that balancer has an unexpired DataEncryptionKey, 
corresponding to an expired BlockKey. When it talks to the other side, the 
expired BlockKey would fail the connection. Therefore my rev 01 patch would fix 
all the problems because of this mismatch.

There are two potential fixes:
* Change BlockTokenSecretManager so that DEK expiration is based on current 
BlockKey expiration.
* Change Balancer to catch InvalidEncryptionKeyException, generate a new DEK 
and repeat the connection.

I feel the first fix is the right one. But it changes every participant in 
HDFS, so want to double check here.

> Long running balancer may fail due to expired DataEncryptionKey
> ---------------------------------------------------------------
>
>                 Key: HDFS-11741
>                 URL: https://issues.apache.org/jira/browse/HDFS-11741
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>         Environment: CDH5.8.2, Kerberos, Data transfer encryption enabled. 
> Balancer login using keytab
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-11741.001.patch
>
>
> We found a long running balancer may fail despite using keytab, because 
> KeyManager returns expired DataEncryptionKey, and it throws the following 
> exception:
> {noformat}
> 2017-04-30 05:03:58,661 WARN  [pool-1464-thread-10] balancer.Dispatcher 
> (Dispatcher.java:dispatch(325)) - Failed to move blk_1067352712_3913241 with 
> size=546650 from 10.0.0.134:50010:DISK to 10.0.0.98:50010:DISK through 
> 10.0.0.134:50010
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException: 
> Can't re-compute encryption key for nonce, since the required block key 
> (keyID=1005215027) doesn't exist. Current key: 1005215030
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:417)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:299)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:242)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.socketSend(SaslDataTransferClient.java:183)
>         at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.dispatch(Dispatcher.java:311)
>         at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2300(Dispatcher.java:182)
>         at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$1.run(Dispatcher.java:899)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This bug is similar in nature to HDFS-10609. While balancer KeyManager 
> actively synchronizes itself with NameNode w.r.t block keys, it does not 
> update DataEncryptionKey accordingly.
> In a specific cluster, with Kerberos ticket life time 10 hours, and default 
> block token expiration/life time 10 hours, a long running balancer failed 
> after 20~30 hours.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-11741) Long running balancer may fail due to expired DataEncryptionKey

Reply via email to