[
https://issues.apache.org/jira/browse/HDFS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749423#comment-13749423
]
Vinay commented on HDFS-5014:
-----------------------------
{quote}However, I think updateActorStatesFromHeartbeat needs to hold the write
lock for the entire method.
{quote}
Actually if we try holding writeLock() throughout method, then that will be
same as adding method level synchronization. where we would get the same issue
as this jira.
{quote}we could get some unfortunate interleavings of state in
bpServiceToActive, bposThinksActive, lastActiveClaimTxId, and
isMoreRecentClaim.{quote}
I dint understand how we could get 'some unfortunate interleavings'. Can you
please explain.. I thought since we are reading under lock, we would get
correct state always.
Regarding findbugs, I will try to fix it.
> BPOfferService#processCommandFromActor() synchronization on namenode RPC call
> delays IBR to Active NN, if Stanby NN is unstable
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-5014
> URL: https://issues.apache.org/jira/browse/HDFS-5014
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, ha
> Affects Versions: 3.0.0, 2.0.4-alpha
> Reporter: Vinay
> Assignee: Vinay
> Attachments: HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch
>
>
> In one of our cluster, following has happened which failed HDFS write.
> 1. Standby NN was unstable and continously restarting due to some errors. But
> Active NN was stable.
> 2. MR Job was writing files.
> 3. At some point SNN went down again while datanode processing the REGISTER
> command for SNN.
> 4. Datanodes started retrying to connect to SNN to register at the following
> code in BPServiceActor#retrieveNamespaceInfo() which will be called under
> synchronization.
> {code} try {
> nsInfo = bpNamenode.versionRequest();
> LOG.debug(this + " received versionRequest response: " + nsInfo);
> break;{code}
> Unfortunately in all datanodes at same point this happened.
> 5. For next 7-8 min standby was down, and no blocks were reported to active
> NN at this point and writes have failed.
> So culprit is {{BPOfferService#processCommandFromActor()}} is completely
> synchronized which is not required.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira