[ 
https://issues.apache.org/jira/browse/HDFS-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDFS-15495:
------------------------------
    Description: 
Originally discovered in patched CDH 6.2.1 (with a bunch of EC fixes: 
HDFS-14699, HDFS-14849, HDFS-14847, HDFS-14920, HDFS-14768, HDFS-14946, 
HDFS-15186).

When there's an EC file marked as corrupted on NN, if the admin tries to 
decommission a DataNode having one of the remaining blocks of the corrupted EC 
file, *the decom will never finish* unless the file is recovered by putting the 
missing blocks back in:

{code:title=The endless DatanodeAdminManager check loop, every 30s}
2020-07-23 16:36:12,805 TRACE blockmanagement.DatanodeAdminManager: Processed 0 
blocks so far this tick
2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Processing 
Decommission In Progress node 127.0.1.7:5007
2020-07-23 16:36:12,806 TRACE blockmanagement.DatanodeAdminManager: Block 
blk_-9223372036854775728_1013 numExpected=9, numLive=4
2020-07-23 16:36:12,806 INFO BlockStateChange: Block: 
blk_-9223372036854775728_1013, Expected Replicas: 9, live replicas: 4, corrupt 
replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, excess 
replicas: 0, Is Open File: false, Datanodes having this block: 127.0.1.12:5012 
127.0.1.10:5010 127.0.1.8:5008 127.0.1.11:5011 127.0.1.7:5007 , Current 
Datanode: 127.0.1.7:5007, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Node 
127.0.1.7:5007 still has 1 blocks to replicate before it is a candidate to 
finish Decommission In Progress.
2020-07-23 16:36:12,806 INFO blockmanagement.DatanodeAdminManager: Checked 1 
blocks and 1 nodes this tick
{code}

"Corrupted" file here meaning the EC file doesn't have enough EC blocks in the 
block group to be reconstructed. e.g. for {{RS-6-3-1024k}}, when there are less 
than 6 blocks for an EC file, the file can no longer be retrieved correctly.

  was:
Originally discovered in patched CDH 6.2.1 (with a bunch of EC fixes: 
HDFS-14699, HDFS-14849, HDFS-14847, HDFS-14920, HDFS-14768, HDFS-14946, 
HDFS-15186).

When there's an EC file marked as corrupted on NN, if the admin tries to 
decommission a DataNode having one of the remaining blocks of the corrupted EC 
file, *the decom will never finish* unless the file is recovered by putting the 
missing blocks back in:

{code:title=The endless DatanodeAdminManager check loop, every 30s}
2020-07-23 16:36:12,805 TRACE blockmanagement.DatanodeAdminManager: Processed 0 
blocks so far this tick
2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Processing 
Decommission In Progress node 127.0.1.7:5007
2020-07-23 16:36:12,806 TRACE blockmanagement.DatanodeAdminManager: Block 
blk_-9223372036854775728_1013 numExpected=9, numLive=4
2020-07-23 16:36:12,806 INFO BlockStateChange: Block: 
blk_-9223372036854775728_1013, Expected Replicas: 9, live replicas: 4, corrupt 
replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, excess 
replicas: 0, Is Open File: false, Datanodes having this block: 127.0.1.12:5012 
127.0.1.10:5010 127.0.1.8:5008 127.0.1.11:5011 127.0.1.7:5007 , Current 
Datanode: 127.0.1.7:5007, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Node 
127.0.1.7:5007 still has 1 blocks to replicate before it is a candidate to 
finish Decommission In Progress.
2020-07-23 16:36:12,806 INFO blockmanagement.DatanodeAdminManager: Checked 1 
blocks and 1 nodes this tick
{code}

"Corrupted" file here meaning the EC file doesn't have enough EC blocks in the 
block group to be reconstructed. e.g. for {{RS-6-3-1024k}}, when there are less 
than 6 blocks for an EC file, the file can no longer be retrieved correctly.

Will check on trunk as well soon.


> Decommissioning a DataNode with corrupted EC files should not be blocked 
> indefinitely
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-15495
>                 URL: https://issues.apache.org/jira/browse/HDFS-15495
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: block placement, ec
>    Affects Versions: 3.0.0
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Major
>
> Originally discovered in patched CDH 6.2.1 (with a bunch of EC fixes: 
> HDFS-14699, HDFS-14849, HDFS-14847, HDFS-14920, HDFS-14768, HDFS-14946, 
> HDFS-15186).
> When there's an EC file marked as corrupted on NN, if the admin tries to 
> decommission a DataNode having one of the remaining blocks of the corrupted 
> EC file, *the decom will never finish* unless the file is recovered by 
> putting the missing blocks back in:
> {code:title=The endless DatanodeAdminManager check loop, every 30s}
> 2020-07-23 16:36:12,805 TRACE blockmanagement.DatanodeAdminManager: Processed 
> 0 blocks so far this tick
> 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: 
> Processing Decommission In Progress node 127.0.1.7:5007
> 2020-07-23 16:36:12,806 TRACE blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372036854775728_1013 numExpected=9, numLive=4
> 2020-07-23 16:36:12,806 INFO BlockStateChange: Block: 
> blk_-9223372036854775728_1013, Expected Replicas: 9, live replicas: 4, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 127.0.1.12:5012 127.0.1.10:5010 127.0.1.8:5008 127.0.1.11:5011 127.0.1.7:5007 
> , Current Datanode: 127.0.1.7:5007, Is current datanode decommissioning: 
> true, Is current datanode entering maintenance: false
> 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Node 
> 127.0.1.7:5007 still has 1 blocks to replicate before it is a candidate to 
> finish Decommission In Progress.
> 2020-07-23 16:36:12,806 INFO blockmanagement.DatanodeAdminManager: Checked 1 
> blocks and 1 nodes this tick
> {code}
> "Corrupted" file here meaning the EC file doesn't have enough EC blocks in 
> the block group to be reconstructed. e.g. for {{RS-6-3-1024k}}, when there 
> are less than 6 blocks for an EC file, the file can no longer be retrieved 
> correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to