[jira] [Commented] (HDFS-10819) BlockManager fails to store a good block for a datanode storage after it reported a corrupt block — block replication stuck

Manoj Govindassamy (JIRA) Tue, 04 Oct 2016 17:43:39 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15547214#comment-15547214
 ]


Manoj Govindassamy commented on HDFS-10819:
-------------------------------------------

[~andrew.wang],

{quote} Also curious, would invalidation eventually fix this case, or is it 
truly stuck? {quote}
* I find it totally stuck in this test case as we have only 3 DNs and the 
expected Replication factor is also 3. 
* Block invalidation was not going through and the replication factor failed to 
catch up. The reason why Block Invalidation at DataNode didn't go through was 
because the disk which held the block is already closed as it was removed.

{noformat}
 730 2016-10-04 15:52:30,709 WARN  impl.FsDatasetImpl 
(FsDatasetImpl.java:invalidate(1990)) - Volume 
/Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current
 is closed, ignore the deletion task for block ReplicaBeingWritten, 
blk_1073741825_1001, RBW
 731   getNumBytes()     = 512
 732   getBytesOnDisk()  = 512
 733   getVisibleLength()= 512
 734   getVolume()       = 
/Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current
 735   getBlockFile()    = 
/Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-473099417-172.16.3.66-1475621545787/current/rbw/blk_1073741825
 736   bytesAcked=512
 737   bytesOnDisk=512
{noformat}

The core fix here is letting {{BlockManager#addStoredBlockUnderConstruction}} 
invoke {{addStoredBlock}} for all Finalized blocks and let addStoredBlocks 
decide on (which is already happening) follow up actions of invalidations 
removal of corrupt replicas. 

[~andrew.wang], [~eddyxu], would like to hear your further thoughts on this.



> BlockManager fails to store a good block for a datanode storage after it 
> reported a corrupt block — block replication stuck
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10819
>                 URL: https://issues.apache.org/jira/browse/HDFS-10819
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>         Attachments: HDFS-10819.001.patch
>
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test 
> testRemoveVolumeBeingWrittenForDatanode.  Data write pipeline can have issues 
> as there could be timeouts, data node not reachable etc, and in this test 
> case it was more of induced one as one of the volumes in a datanode is 
> removed while block write is in progress. Digging further in the logs, when 
> the problem happens in the write pipeline, the error recovery is not 
> happening as expected leading to block replication never catching up.
> Though this problem has same signature as in HDFS-10780, from the logs it 
> looks like the code paths taken are totally different and so the root cause 
> could be different as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10819) BlockManager fails to store a good block for a datanode storage after it reported a corrupt block — block replication stuck

Reply via email to