[
https://issues.apache.org/jira/browse/HDFS-10609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452812#comment-15452812
]
Wei-Chiu Chuang edited comment on HDFS-10609 at 8/31/16 5:25 PM:
-----------------------------------------------------------------
Thanks [~xiaochen] and [~andrew.wang] for reviewing the patch! Took me a while
to refactor the code.
bq. Searching for InvalidEncryptionKeyException, I see a number of places (e.g.
DFSInputStream and DFSStripedInputStream), are those validated for similar
fixes?
Given that the exception is thrown when a SASL client connects
{{SaslDataTransferClient#socketSend}}, there are 7 usage of this method that
could potentially throw the exception.
{{StripedBlockWriter#init}}, {{Dispatcher.PendingMove#dispatch}},
{{DataXceiver#replaceBlock}}, {{DataXceiver#writeBlock}},
{{DataNode.DataTransfer#run}}
In fact most of these methods share the same boilerplate code and should
probably be extracted into a common method. I have not seen the same exception
happening in these places though. These are encryption keys between DataNodes.
I am not sure if a new encryption key should be generated in these scenario.
bq. Is it possible to break the try/catch out into a wrapper function to make
the logic more clear?
Good idea. I refactored most of {{DataStreamer#transfer}} into a new class
{{StreamerStreams}}. I hope this is clearer. Since
{{DataStreamer#createBlockOutputStream}} shares most of the boilerplate code, I
can also refactor that method in a follow up jira.
bq. Any way to avoid the sleep in the test?
I've been trying to do this in a less invasive approach, but I ended up using
{{GenericTestUtils#waitFor}} to detect invalid encryption key. It still uses
sleep underneath, but this is still better than an arbitrary sleep period.
bq. Does the test verify that a new key was properly fetched? It's better to
assert the behavior you're looking for.
Added an assert to ensure the cached encryption key is cleared. The existing
logic fetches a new encryption key when there's none available.
bq. Good opportunity to refactor the test class to use @Before and @After to do
init and teardown, and avoid the try/catch.
Done.
was (Author: jojochuang):
Thanks [~xiaochen] and [~andrew.wang] for reviewing the patch! Took me a while
to refactor the code.
bq. Searching for InvalidEncryptionKeyException, I see a number of places (e.g.
DFSInputStream and DFSStripedInputStream), are those validated for similar
fixes?
Given that the exception is thrown when a SASL client connects
{{SaslDataTransferClient#socketSend}}, there are 7 usage of this method that
could potentially throw the exception.
{{StripedBlockWriter#init}}, {{Dispatcher.PendingMove#dispatch}},
{{DataXceiver#replaceBlock}}, {{DataXceiver#writeBlock}},
{{DataNode.DataTransfer#run}}
In fact more of these methods share the same boilerplate code and should
probably be extracted into a common method. I have not seen the same exception
happening in these places though. These are encryption keys between DataNodes.
I am not sure if a new encryption key should be generated in these scenario.
bq. Is it possible to break the try/catch out into a wrapper function to make
the logic more clear?
Good idea. I refactored most of {{DataStreamer#transfer}} into a new class
{{StreamerStreams}}. I hope this is clearer. Since
{{DataStreamer#createBlockOutputStream}} shares most of the boilerplate code, I
can also refactor that method in a follow up jira.
bq. Any way to avoid the sleep in the test?
I've been trying to do this in a less invasive approach, but I ended up using
{{GenericTestUtils#waitFor}} to detect invalid encryption key. It still uses
sleep underneath, but this is still better than an arbitrary sleep period.
bq. Does the test verify that a new key was properly fetched? It's better to
assert the behavior you're looking for.
Added an assert to ensure the cached encryption key is cleared. The existing
logic fetches a new encryption key when there's none available.
bq. Good opportunity to refactor the test class to use @Before and @After to do
init and teardown, and avoid the try/catch.
Done.
> Uncaught InvalidEncryptionKeyException during pipeline recovery may abort
> downstream applications
> -------------------------------------------------------------------------------------------------
>
> Key: HDFS-10609
> URL: https://issues.apache.org/jira/browse/HDFS-10609
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: encryption
> Affects Versions: 2.6.0
> Environment: CDH5.8.0
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10609.001.patch, HDFS-10609.002.patch,
> HDFS-10609.003.patch
>
>
> In normal operations, if SASL negotiation fails due to
> {{InvalidEncryptionKeyException}}, it is typically a benign exception, which
> is caught and retried :
> {code:title=SaslDataTransferServer#doSaslHandshake}
> if (ioe instanceof SaslException &&
> ioe.getCause() != null &&
> ioe.getCause() instanceof InvalidEncryptionKeyException) {
> // This could just be because the client is long-lived and hasn't gotten
> // a new encryption key from the NN in a while. Upon receiving this
> // error, the client will get a new encryption key from the NN and retry
> // connecting to this DN.
> sendInvalidKeySaslErrorMessage(out, ioe.getCause().getMessage());
> }
> {code}
> {code:title=DFSOutputStream.DataStreamer#createBlockOutputStream}
> if (ie instanceof InvalidEncryptionKeyException && refetchEncryptionKey > 0) {
> DFSClient.LOG.info("Will fetch a new encryption key and retry, "
> + "encryption key was invalid when connecting to "
> + nodes[0] + " : " + ie);
> {code}
> However, if the exception is thrown during pipeline recovery, the
> corresponding code does not handle it properly, and the exception is spilled
> out to downstream applications, such as SOLR, aborting its operation:
> {quote}
> 2016-07-06 12:12:51,992 ERROR org.apache.solr.update.HdfsTransactionLog:
> Exception closing tlog.
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException:
> Can't re-compute encryption key for nonce, since the required block key
> (keyID=557709482) doesn't exist. Current key: 1350592619
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:417)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:299)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:242)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.socketSend(SaslDataTransferClient.java:183)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1308)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1272)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1433)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1147)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:632)
> 2016-07-06 12:12:51,997 ERROR org.apache.solr.update.CommitTracker: auto
> commit error...:org.apache.solr.common.SolrException:
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException:
> Can't re-compute encryption key for nonce, since the required block key
> (keyID=557709482) doesn't exist. Current key: 1350592619
> at
> org.apache.solr.update.HdfsTransactionLog.close(HdfsTransactionLog.java:316)
> at
> org.apache.solr.update.TransactionLog.decref(TransactionLog.java:505)
> at org.apache.solr.update.UpdateLog.addOldLog(UpdateLog.java:380)
> at org.apache.solr.update.UpdateLog.postCommit(UpdateLog.java:676)
> at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:623)
> at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.hdfs.protocol.datatransfer.InvalidEncryptionKeyException:
> Can't re-compute encryption key for nonce, since the required block key
> (keyID=557709482) doesn't exist. Current key: 1350592619
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:417)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:474)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:299)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:242)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.socketSend(SaslDataTransferClient.java:183)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1308)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1272)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1433)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1147)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:632)
> {quote}
> This exception should be contained within HDFS, caught and retried just like
> in {{createBlockOutputStream()}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]