[
https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15547214#comment-15547214
]
Manoj Govindassamy commented on HDFS-10819:
-------------------------------------------
[~andrew.wang],
{quote} Also curious, would invalidation eventually fix this case, or is it
truly stuck? {quote}
* I find it totally stuck in this test case as we have only 3 DNs and the
expected Replication factor is also 3.
* Block invalidation was not going through and the replication factor failed to
catch up. The reason why Block Invalidation at DataNode didn't go through was
because the disk which held the block is already closed as it was removed.
{noformat}
730 2016-10-04 15:52:30,709 WARN impl.FsDatasetImpl
(FsDatasetImpl.java:invalidate(1990)) - Volume
/Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current
is closed, ignore the deletion task for block ReplicaBeingWritten,
blk_1073741825_1001, RBW
731 getNumBytes() = 512
732 getBytesOnDisk() = 512
733 getVisibleLength()= 512
734 getVolume() =
/Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current
735 getBlockFile() =
/Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-473099417-172.16.3.66-1475621545787/current/rbw/blk_1073741825
736 bytesAcked=512
737 bytesOnDisk=512
{noformat}
The core fix here is letting {{BlockManager#addStoredBlockUnderConstruction}}
invoke {{addStoredBlock}} for all Finalized blocks and let addStoredBlocks
decide on (which is already happening) follow up actions of invalidations
removal of corrupt replicas.
[~andrew.wang], [~eddyxu], would like to hear your further thoughts on this.
> BlockManager fails to store a good block for a datanode storage after it
> reported a corrupt block — block replication stuck
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10819
> URL: https://issues.apache.org/jira/browse/HDFS-10819
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.0.0-alpha1
> Reporter: Manoj Govindassamy
> Assignee: Manoj Govindassamy
> Attachments: HDFS-10819.001.patch
>
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test
> testRemoveVolumeBeingWrittenForDatanode. Data write pipeline can have issues
> as there could be timeouts, data node not reachable etc, and in this test
> case it was more of induced one as one of the volumes in a datanode is
> removed while block write is in progress. Digging further in the logs, when
> the problem happens in the write pipeline, the error recovery is not
> happening as expected leading to block replication never catching up.
> Though this problem has same signature as in HDFS-10780, from the logs it
> looks like the code paths taken are totally different and so the root cause
> could be different as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]