[
https://issues.apache.org/jira/browse/HDFS-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900143#comment-15900143
]
Chen Liang edited comment on HDFS-11507 at 3/7/17 9:01 PM:
-----------------------------------------------------------
I found that there is global locking *before and after entering*
{{chooseRandom}} call. In which case {{chooseRandom}} is already synchronized
with node add/remove. I was confused by acquiring the lock again in
{{countNumOfAvailableNodes}} and thought this is the first time the lock is
acquired. Now this race condition does not exist. Close this JIRA as not a
problem.
was (Author: vagarychen):
I found that there is global locking *before and after* entering chooseRandom
call. In which case chooseRandom is already synchronized with node add/remove.
I was confused by acquiring the lock again in {{countNumOfAvailableNodes}} and
thought this is the first time the lock is acquired. Now this race condition
does not exist. Close this JIRA as not a problem.
> NetworkTopology#chooseRandom may run into a dead loop due to race condition
> ---------------------------------------------------------------------------
>
> Key: HDFS-11507
> URL: https://issues.apache.org/jira/browse/HDFS-11507
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Reporter: Chen Liang
> Assignee: Chen Liang
>
> {{NetworkTopology#chooseRandom()}} works as:
> 1. counts the number of available nodes as {{availableNodes}},
> 2. checks how many nodes are excluded, deduct from {{availableNodes}}
> 3. if {{availableNodes}} still > 0, then there are nodes available.
> 4. keep looping to find that node
> But now imagine, in the meantime, the actually available nodes got removed in
> step 3 or step 4, and all remaining nodes are excluded nodes. Then, although
> there are no more nodes actually available, the code would still run as
> {{availableNodes}} > 0, and then it would keep getting excluded node and loop
> forever, as
> {{if (excludedNodes == null || !excludedNodes.contains(ret))}}
> will always be false.
> We may fix this by expanding the while loop to also include the
> {{availableNodes}} calculation. Such that we re-calculate {{availableNodes}}
> every time it fails to find an available node.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]