[
https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15459764#comment-15459764
]
Andrew Wang commented on HDFS-10819:
------------------------------------
Thanks for working on this Manoj. Great investigation here.
IIUC this is going to be a problem mostly for small clusters, right? We need to
have a collision between two genstamps of the same block.
Would this also be addressed by having the NN first invalidate the corrupt
replica before replicating the correct one? I'm wondering if the safer fix is
to wait for this invalidation by excluding nodes with corrupt replicas when
doing block placement.
Also curious, would invalidation eventually fix this case, or is it truly
stuck? That seems like another bug we should address.
> BlockManager fails to store a good block for a datanode storage after it
> reported a corrupt block — block replication stuck
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10819
> URL: https://issues.apache.org/jira/browse/HDFS-10819
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.0.0-alpha1
> Reporter: Manoj Govindassamy
> Assignee: Manoj Govindassamy
> Attachments: HDFS-10819.001.patch
>
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test
> testRemoveVolumeBeingWrittenForDatanode. Data write pipeline can have issues
> as there could be timeouts, data node not reachable etc, and in this test
> case it was more of induced one as one of the volumes in a datanode is
> removed while block write is in progress. Digging further in the logs, when
> the problem happens in the write pipeline, the error recovery is not
> happening as expected leading to block replication never catching up.
> Though this problem has same signature as in HDFS-10780, from the logs it
> looks like the code paths taken are totally different and so the root cause
> could be different as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]