[ 
https://issues.apache.org/jira/browse/HDFS-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167200#comment-17167200
 ] 

Stephen O'Donnell commented on HDFS-15495:
------------------------------------------

There are a few things to think about here.

For non EC blocks:

 * If a block is missing it will not block replication, as it will not be on 
the DN in question and hence will not be checked.
 * If a block is under-replicated already, then decomission should proceed OK, 
provided the block can be made perfectly replicated.
 * Decommission will block if there are not enough nodes on a cluster to make 
the blocks perfectly replicated - eg decommission 1 node from a 3 node cluster.

For EC, a missing block is more complicated. Consider a 6-3 EC file.

 * If there is already 1 to 3 blocks lost, the file is still readable. If you 
decommission a host holding the block, I think it will first reconstruct the 
missing 1 - 3 blocks, and then schedule a simple copy of the decommission block.
 * If it >3 blocks are lost, then it will not be able to complete the first 
step, and then will never get to the second step and it will likely hang (I 
have not tested it out myself as yet).

Looking at the code, I think the NN does not check if there are sufficient EC 
block sources before it schedules the reconstruction work on a DN - it is left 
to the DN to figure that part out and fail the task.

It looks like we might need to do something a bit smarter in ErasureCodingWork 
to allow the block being decommissioned to be copied to a new DN even if EC 
reconstruction cannot happen. Something would also need to change in the 
Decommission logic to notice the file is corrupt and also handle the local 
block, and not wait for the file to be healthy.

You could argue that decommission should not care about the health of the EC 
file - it should just ensure any blocks on the decommissioning hosts get copied 
elsewhere before decommission can complete.

> Decommissioning a DataNode with corrupted EC files should not be blocked 
> indefinitely
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-15495
>                 URL: https://issues.apache.org/jira/browse/HDFS-15495
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: block placement, ec
>    Affects Versions: 3.0.0
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Major
>
> Originally discovered in patched CDH 6.2.1 (with a bunch of EC fixes: 
> HDFS-14699, HDFS-14849, HDFS-14847, HDFS-14920, HDFS-14768, HDFS-14946, 
> HDFS-15186).
> When there's an EC file marked as corrupted on NN, if the admin tries to 
> decommission a DataNode having one of the remaining blocks of the corrupted 
> EC file, *the decom will never finish* unless the file is recovered by 
> putting the missing blocks back in:
> {code:title=The endless DatanodeAdminManager check loop, every 30s}
> 2020-07-23 16:36:12,805 TRACE blockmanagement.DatanodeAdminManager: Processed 
> 0 blocks so far this tick
> 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: 
> Processing Decommission In Progress node 127.0.1.7:5007
> 2020-07-23 16:36:12,806 TRACE blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372036854775728_1013 numExpected=9, numLive=4
> 2020-07-23 16:36:12,806 INFO BlockStateChange: Block: 
> blk_-9223372036854775728_1013, Expected Replicas: 9, live replicas: 4, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 127.0.1.12:5012 127.0.1.10:5010 127.0.1.8:5008 127.0.1.11:5011 127.0.1.7:5007 
> , Current Datanode: 127.0.1.7:5007, Is current datanode decommissioning: 
> true, Is current datanode entering maintenance: false
> 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Node 
> 127.0.1.7:5007 still has 1 blocks to replicate before it is a candidate to 
> finish Decommission In Progress.
> 2020-07-23 16:36:12,806 INFO blockmanagement.DatanodeAdminManager: Checked 1 
> blocks and 1 nodes this tick
> {code}
> "Corrupted" file here meaning the EC file doesn't have enough EC blocks in 
> the block group to be reconstructed. e.g. for {{RS-6-3-1024k}}, when there 
> are less than 6 blocks for an EC file, the file can no longer be retrieved 
> correctly.
> Will check on trunk as well soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to