[
https://issues.apache.org/jira/browse/HDFS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749447#comment-13749447
]
Chris Nauroth commented on HDFS-5014:
-------------------------------------
I'm going to be mostly offline until 9/2. I'm not +1 for the current patch,
but I would be +1 if it was changed to hold the write lock for all of
{{updateActorStatesFromHeartbeat}}. I'd like to seek out a second opinion
though, because concurrency is hard. :-) [~tlipcon], maybe you could take a
look since you originally coded HDFS-2627?
Vinay, thank you for reporting the bug and working through the feedback on the
patch.
bq. I dint understand how we could get 'some unfortunate interleavings'. Can
you please explain.. I thought since we are reading under lock, we would get
correct state always.
The challenge is that we are updating multiple pieces of data
({{bpServiceToActive}}, {{bposThinksActive}}, {{lastActiveClaimTxId}}, and
{{isMoreRecentClaim}}), and the failover handling logic isn't correct unless
those pieces of data are consistent with one another. Without mutual
exclusion, we can end up with {{bposThinksActive}} true even though
{{bpServiceToActive}} has changed, or we could end up with
{{isMoreRecentClaim}} true, even though a more recent transaction ID has been
seen.
In the following examples, assume T1 and T2 are separate threads for separate
{{BPServiceActor}} instances/separate name nodes.
Example 1: race on {{bposThinksActive}}
# T1 acquires read lock.
# T1 calculates {{bposThinksActive}} true.
# T1 releases read lock.
# The OS suspends T1 here. (Note that T1 is holding neither read lock nor
write lock at this point.)
# T2 wakes up, processes a heartbeat, and executes all of
{{updateActorStatesFromHeartbeat}}. It determines that T2 is the new active,
so it updates {{bpServiceToActive}}.
# The OS suspends T2 here.
# T1 resumes. It still thinks {{bposThinksActive}} is true, even though the
other thread has changed {{bpServiceToActive}}.
# T1 now can fall into the block for {{else if (!nnClaimsActive &&
bposThinksActive)}} and set {{bpServiceToActive}} null.
# The OS suspends T1 and resumes T2.
# T2 now erroneously skips commands from the active NN, because
{{bpServiceToActive}} is null.
Example 2: race on {{isMoreRecentClaim}}
# T1 acquires read lock.
# T1 calculates {{isMoreRecentClaim}} based on {{lastActiveClaimTxId}} 1 and
{{txid}} 2 from the heartbeat.
# T1 releases read lock.
# The OS suspends T1 here. (Note that T1 is holding neither read lock nor
write lock at this point.)
# T2 wakes up, processes a heartbeat, and executes all of
{{updateActorStatesFromHeartbeat}}. It updates {{lastActiveClaimTxId}} to 3.
# The OS suspends T2 here.
# T1 resumes. It still thinks {{isMoreRecentClaim}} is true, even though the
other thread has changed {{lastActiveClaimTxId}}.
# T1 now can set {{bpServiceToActive}} to itself instead of going into the
split-brain warning path.
# The OS suspends T1 and resumes T2.
# T2 now erroneously skips commands from the active NN, because
{{bpServiceToActive}} was changed.
bq. Actually if we try holding writeLock() throughout method, then that will be
same as adding method level synchronization. where we would get the same issue
as this jira.
My understanding is that you had an actor get stuck in the re-registration loop
when it got a {{DNA_REGISTER}} request from a flapping namenode. Since
{{processCommandFromActor}} was synchronized, this also blocked the actor
thread for the healthy namenode from making progress. Therefore, we can solve
the problem by allowing concurrent executions of {{processCommandFromActor}},
so that the healthy actor can make progress even if the unhealthy actor gets
stuck. However, we don't need to allow concurrent execution of the
heartbeat/HA failover processing (and I think it would be incorrect to do so
based on the examples above).
Unlike {{processCommandFromActor}}, the {{updateActorStatesFromHeartbeat}}
method only changes local in-memory state. There is no polling loop or RPC
involved, so there is no risk that one actor gets stuck inside this method and
blocks the other. I expect minimal lock contention on the write lock.
> BPOfferService#processCommandFromActor() synchronization on namenode RPC call
> delays IBR to Active NN, if Stanby NN is unstable
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-5014
> URL: https://issues.apache.org/jira/browse/HDFS-5014
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, ha
> Affects Versions: 3.0.0, 2.0.4-alpha
> Reporter: Vinay
> Assignee: Vinay
> Attachments: HDFS-5014.patch, HDFS-5014.patch, HDFS-5014.patch,
> HDFS-5014.patch
>
>
> In one of our cluster, following has happened which failed HDFS write.
> 1. Standby NN was unstable and continously restarting due to some errors. But
> Active NN was stable.
> 2. MR Job was writing files.
> 3. At some point SNN went down again while datanode processing the REGISTER
> command for SNN.
> 4. Datanodes started retrying to connect to SNN to register at the following
> code in BPServiceActor#retrieveNamespaceInfo() which will be called under
> synchronization.
> {code} try {
> nsInfo = bpNamenode.versionRequest();
> LOG.debug(this + " received versionRequest response: " + nsInfo);
> break;{code}
> Unfortunately in all datanodes at same point this happened.
> 5. For next 7-8 min standby was down, and no blocks were reported to active
> NN at this point and writes have failed.
> So culprit is {{BPOfferService#processCommandFromActor()}} is completely
> synchronized which is not required.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira