[ 
https://issues.apache.org/jira/browse/HDFS-10609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-10609:
-----------------------------------
    Attachment: HDFS-10609.001.patch

Patch v01. A simple fix and a test case.

The test case creates a scenario where a file is present on the cluster, and 
the cluster's BlockTokenSecretManager is manipulated such that the block's 
token expires after a short duration. After sleeping for 15 seconds, shutdown 
the datanode, and the client write to the file again to induce pipeline 
recovery.

Note that because the client retries block transfer three times, there are two 
possible exceptions without the fix:
* InvalidEncryptionKeyException because the token expires
* IOException "Failed to replace a bad datanode on the existing pipeline due to 
no more good datanodes being available to try" because the cluster has only 4 
datanodes, and after the first attempt fails due to 
InvalidEncryptionKeyException which excludes one datanode, the subsequent 
attempt will see this exception.

> Uncaught InvalidEncryptionKeyException during pipeline recovery may abort 
> downstream applications
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10609
>                 URL: https://issues.apache.org/jira/browse/HDFS-10609
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: encryption
>    Affects Versions: 2.6.0
>         Environment: CDH5.8.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10609.001.patch
>
>
> In normal operations, if SASL negotiation fails due to 
> {{InvalidEncryptionKeyException}}, it is typically a benign exception, which 
> is caught and retried :
> {code:title=SaslDataTransferServer#doSaslHandshake}
>   if (ioe instanceof SaslException &&
>       ioe.getCause() != null &&
>       ioe.getCause() instanceof InvalidEncryptionKeyException) {
>     // This could just be because the client is long-lived and hasn't gotten
>     // a new encryption key from the NN in a while. Upon receiving this
>     // error, the client will get a new encryption key from the NN and retry
>     // connecting to this DN.
>     sendInvalidKeySaslErrorMessage(out, ioe.getCause().getMessage());
>   } 
> {code}
> {code:title=DFSOutputStream.DataStreamer#createBlockOutputStream}
> if (ie instanceof InvalidEncryptionKeyException && refetchEncryptionKey > 0) {
>             DFSClient.LOG.info("Will fetch a new encryption key and retry, " 
>                 + "encryption key was invalid when connecting to "
>                 + nodes[0] + " : " + ie);
> {code}
> However, if the exception is thrown during pipeline recovery, the 
> corresponding code does not handle it properly, and the exception is spilled 
> out to downstream applications, such as SOLR, aborting its operation:
> {quote}
> 2016-07-06 12:12:51,992 ERROR org.apache.solr.update.HdfsTransactionLog: 
> Exception closing tlog.
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException: 
> Can't re-compute encryption key for nonce, since the required block key 
> (keyID=557709482) doesn't exist. Current key: 1350592619
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:417)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:299)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:242)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.socketSend(SaslDataTransferClient.java:183)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1308)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1272)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1433)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1147)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:632)
> 2016-07-06 12:12:51,997 ERROR org.apache.solr.update.CommitTracker: auto 
> commit error...:org.apache.solr.common.SolrException: 
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException: 
> Can't re-compute encryption key for nonce, since the required block key 
> (keyID=557709482) doesn't exist. Current key: 1350592619
>         at 
> org.apache.solr.update.HdfsTransactionLog.close(HdfsTransactionLog.java:316)
>         at 
> org.apache.solr.update.TransactionLog.decref(TransactionLog.java:505)
>         at org.apache.solr.update.UpdateLog.addOldLog(UpdateLog.java:380)
>         at org.apache.solr.update.UpdateLog.postCommit(UpdateLog.java:676)
>         at 
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:623)
>         at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException: 
> Can't re-compute encryption key for nonce, since the required block key 
> (keyID=557709482) doesn't exist. Current key: 1350592619
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:417)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:299)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:242)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.socketSend(SaslDataTransferClient.java:183)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1308)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1272)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1433)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1147)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:632)
> {quote}
> This exception should be contained within HDFS, caught and retried just like 
> in {{createBlockOutputStream()}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to