[
https://issues.apache.org/jira/browse/HDFS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13827821#comment-13827821
]
Chris Nauroth commented on HDFS-5014:
-------------------------------------
+1 for the latest from me too. [~vinayrpet], thanks so much for providing a
patch for this tricky issue and responding to all of the code review feedback.
> BPOfferService#processCommandFromActor() synchronization on namenode RPC call
> delays IBR to Active NN, if Stanby NN is unstable
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-5014
> URL: https://issues.apache.org/jira/browse/HDFS-5014
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, ha
> Affects Versions: 3.0.0, 2.0.4-alpha
> Reporter: Vinay
> Assignee: Vinay
> Attachments: HDFS-5014-v2.patch, HDFS-5014-v2.patch,
> HDFS-5014-v2.patch, HDFS-5014-v2.patch, HDFS-5014-v2.patch,
> HDFS-5014-v2.patch, HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch,
> HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch
>
>
> In one of our cluster, following has happened which failed HDFS write.
> 1. Standby NN was unstable and continously restarting due to some errors. But
> Active NN was stable.
> 2. MR Job was writing files.
> 3. At some point SNN went down again while datanode processing the REGISTER
> command for SNN.
> 4. Datanodes started retrying to connect to SNN to register at the following
> code in BPServiceActor#retrieveNamespaceInfo() which will be called under
> synchronization.
> {code} try {
> nsInfo = bpNamenode.versionRequest();
> LOG.debug(this + " received versionRequest response: " + nsInfo);
> break;{code}
> Unfortunately in all datanodes at same point this happened.
> 5. For next 7-8 min standby was down, and no blocks were reported to active
> NN at this point and writes have failed.
> So culprit is {{BPOfferService#processCommandFromActor()}} is completely
> synchronized which is not required.
--
This message was sent by Atlassian JIRA
(v6.1#6144)