[
https://issues.apache.org/jira/browse/HDFS-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375264#comment-15375264
]
Eric Badger commented on HDFS-10618:
------------------------------------
Inside of the Replication Monitor,
BlockManager.computeReconstructionWorkForBlocks() removes blocks from
neededReconstruction, then computes locations for those blocks to be
replicated, and then places them into pendingReconstruction. However, before
computing the locations the write lock is released (and reacquired to add to
pendingReconstruction). testPendingAndInvalidate can expose this race condition
because it also indirectly calls
BlockManager.computeReconstructionWorkForBlocks. The following scenario
outlines how this test can fail:
1. ReplicationMonitor calls computeReconstructionWorkForBlocks, removes blocks
from neededReconstruction, releases the write lock, and takes time computing
the locations for replication
2. testPendingAndInvalidate calls computeReconstructionWorkForBlocks, sees
nothing in neededReconstruction, spends 0 time computing locations, adds
nothing to pendingReconstruction, and returns.
3. testPendingAndInvalidate calls updateState() and indirectly sets
pendingReconstructionBlocksCount to the current value of pendingReconstruction
(which is 0, since the Replication Monitor is still computing the block
locations and hasn't yet added the blocks to pendingReconstruction).
3. testPendingAndInvalidate checks the value of
pendingReconstructionBlocksCount via getPendingReconstructionBlocksCount() and
sees that it is 0, causing the associated assert to fail.
It is unclear to me whether or not this failure can happen outside of this
test, since it is explicitly calling computeReconstructionWorkForBlocks, which
is normally only called by the Replication Monitor.
> TestPendingReconstruction#testPendingAndInvalidate is flaky due to race
> condition
> ---------------------------------------------------------------------------------
>
> Key: HDFS-10618
> URL: https://issues.apache.org/jira/browse/HDFS-10618
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.0.3-alpha
> Reporter: Eric Badger
> Assignee: Eric Badger
>
> TestPendingReconstruction#testPendingAndInvalidate fails intermittently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]