[ 
https://issues.apache.org/jira/browse/HDFS-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900119#comment-15900119
 ] 

Chen Liang edited comment on HDFS-11507 at 3/7/17 8:53 PM:
-----------------------------------------------------------

I noticed the comments below in the NetworkTopology#chooseRandom method.
{code}
// We've counted numOfAvailableNodes inside the lock, so there must be
// at least 1 satisfying node. Keep trying until we found it.
{code}
But I'm inclined to believe that the locking in 
{{countNumOfAvailableNodes(scope, excludedNodes);}} is not going to do the work 
for the issue in this JIRA, because it appears to me that that lock is to 
guarantee that there is no node add/removal when doing the counting, but since 
there is no global locking in {{chooseRandom}} function itself, nodes can still 
added/removed during {{chooseRandom}}, after {{countNumOfAvailableNodes}} has 
returned.

Since these comments were added in HDFS-10320, [~mingma] do you have comments?


was (Author: vagarychen):
I noticed the comments below in the NetworkTopology#chooseRandom method.
{code}
// We've counted numOfAvailableNodes inside the lock, so there must be
// at least 1 satisfying node. Keep trying until we found it.
{code}
But I'm inclined to believe that the locking in 
{{countNumOfAvailableNodes(scope, excludedNodes);}} is not going to do the work 
for the issue in this JIRA, because it appears to me that that lock is to 
guarantee that there is no node add/removal when doing the counting, but since 
there is no global locking in {{chooseRandom}} function itself, nodes can still 
added/removed during {{chooseRandom}}, but after {{countNumOfAvailableNodes}} 
has returned.

Since these comments were added in HDFS-10320, [~mingma] do you have comments?

> NetworkTopology#chooseRandom may run into a dead loop due to race condition
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-11507
>                 URL: https://issues.apache.org/jira/browse/HDFS-11507
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Chen Liang
>            Assignee: Chen Liang
>
> {{NetworkTopology#chooseRandom()}} works as:
> 1. counts the number of available nodes as {{availableNodes}},
> 2. checks how many nodes are excluded, deduct from {{availableNodes}}
> 3. if {{availableNodes}} still > 0, then there are nodes available.
> 4. keep looping to find that node
> But now imagine, in the meantime, the actually available nodes got removed in 
> step 3 or step 4, and all remaining nodes are excluded nodes. Then, although 
> there are no more nodes actually available, the code would still run as 
> {{availableNodes}} > 0, and then it would keep getting excluded node and loop 
> forever, as 
> {{if (excludedNodes == null || !excludedNodes.contains(ret))}} 
> will always be false.
> We may fix this by expanding the while loop to also include the 
> {{availableNodes}} calculation. Such that we re-calculate {{availableNodes}} 
> every time it fails to find an available node.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to