[ 
https://issues.apache.org/jira/browse/HDFS-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375264#comment-15375264
 ] 

Eric Badger commented on HDFS-10618:
------------------------------------

Inside of the Replication Monitor, 
BlockManager.computeReconstructionWorkForBlocks() removes blocks from 
neededReconstruction, then computes locations for those blocks to be 
replicated, and then places them into pendingReconstruction. However, before 
computing the locations the write lock is released (and reacquired to add to 
pendingReconstruction). testPendingAndInvalidate can expose this race condition 
because it also indirectly calls 
BlockManager.computeReconstructionWorkForBlocks. The following scenario 
outlines how this test can fail:

1. ReplicationMonitor calls computeReconstructionWorkForBlocks, removes blocks 
from neededReconstruction, releases the write lock, and takes time computing 
the locations for replication
2. testPendingAndInvalidate calls computeReconstructionWorkForBlocks, sees 
nothing in neededReconstruction, spends 0 time computing locations, adds 
nothing to pendingReconstruction, and returns. 
3. testPendingAndInvalidate calls updateState() and indirectly sets 
pendingReconstructionBlocksCount to the current value of pendingReconstruction 
(which is 0, since the Replication Monitor is still computing the block 
locations and hasn't yet added the blocks to pendingReconstruction).
3. testPendingAndInvalidate checks the value of 
pendingReconstructionBlocksCount via getPendingReconstructionBlocksCount() and 
sees that it is 0, causing the associated assert to fail.

It is unclear to me whether or not this failure can happen outside of this 
test, since it is explicitly calling computeReconstructionWorkForBlocks, which 
is normally only called by the Replication Monitor. 

> TestPendingReconstruction#testPendingAndInvalidate is flaky due to race 
> condition
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-10618
>                 URL: https://issues.apache.org/jira/browse/HDFS-10618
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.3-alpha
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>
> TestPendingReconstruction#testPendingAndInvalidate fails intermittently. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to