[jira] [Comment Edited] (HDFS-11741) Long running balancer may fail due to expired DataEncryptionKey

Wei-Chiu Chuang (JIRA) Thu, 25 May 2017 17:35:38 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025607#comment-16025607
 ]


Wei-Chiu Chuang edited comment on HDFS-11741 at 5/26/17 12:34 AM:
------------------------------------------------------------------

Attached rev 005 patch.
I removed the change in Dispatcher, because it is hard to unit test and it is 
unclear if the fix would actually work. If this case does happen (it should 
only happen with extreme time drift), let's grab stack trace and logs and file 
a new jira to fix it.

As for the proposal to change BlockKeyUpdater or add a DEKUpdater, I don't feel 
it is needed to aggressively update DEK. As in my last comment, after this 
patch, the only way DEK expires after the associated block key expires is the 
balancer node has a > 1* key update interval time drift=10 hours.

But if you think it's still necessary, I suggest we change 
{code}
encryptionKey.expiryDate < timer.now() 
{code}
to 
{code}
encryptionKey.expiryDate - 3*4*keyUpdateInterval < timer.now() 
{code}
This is easier than introducing a new class.


was (Author: jojochuang):
Attached rev 005 patch.
I removed the change in Dispatcher, because it is hard to unit test and it is 
unclear if the fix would actually work.

As for the proposal to change BlockKeyUpdater or add a DEKUpdater, I don't feel 
it is needed to aggressively update DEK. As in my last comment, after this 
patch, the only way DEK expires after the associated block key expires is the 
balancer node has a > 1* key update interval time drift=10 hours.

But if you think it's still necessary, I suggest we change 
{code}
encryptionKey.expiryDate < timer.now() 
{code}
to 
{code}
encryptionKey.expiryDate - 3*4*keyUpdateInterval < timer.now() 
{code}
This is easier than introducing a new class.

> Long running balancer may fail due to expired DataEncryptionKey
> ---------------------------------------------------------------
>
>                 Key: HDFS-11741
>                 URL: https://issues.apache.org/jira/browse/HDFS-11741
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>         Environment: CDH5.8.2, Kerberos, Data transfer encryption enabled. 
> Balancer login using keytab
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: block keys.png, HDFS-11741.001.patch, 
> HDFS-11741.002.patch, HDFS-11741.003.patch, HDFS-11741.004.patch, 
> HDFS-11741.005.patch
>
>
> We found a long running balancer may fail despite using keytab, because 
> KeyManager returns expired DataEncryptionKey, and it throws the following 
> exception:
> {noformat}
> 2017-04-30 05:03:58,661 WARN  [pool-1464-thread-10] balancer.Dispatcher 
> (Dispatcher.java:dispatch(325)) - Failed to move blk_1067352712_3913241 with 
> size=546650 from 10.0.0.134:50010:DISK to 10.0.0.98:50010:DISK through 
> 10.0.0.134:50010
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException: 
> Can't re-compute encryption key for nonce, since the required block key 
> (keyID=1005215027) doesn't exist. Current key: 1005215030
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:417)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:299)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:242)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.socketSend(SaslDataTransferClient.java:183)
>         at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.dispatch(Dispatcher.java:311)
>         at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2300(Dispatcher.java:182)
>         at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$1.run(Dispatcher.java:899)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This bug is similar in nature to HDFS-10609. While balancer KeyManager 
> actively synchronizes itself with NameNode w.r.t block keys, it does not 
> update DataEncryptionKey accordingly.
> In a specific cluster, with Kerberos ticket life time 10 hours, and default 
> block token expiration/life time 10 hours, a long running balancer failed 
> after 20~30 hours.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-11741) Long running balancer may fail due to expired DataEncryptionKey

Reply via email to