[ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhao Yi Ming updated HDFS-14699:
--------------------------------
    Attachment: HDFS-14699.00.patch
                image-2019-08-20-19-58-35-123.png
                image-2019-08-20-19-58-51-872.png
        Labels: patch  (was: )
        Status: Patch Available  (was: In Progress)

The problem root cause is when the node.getNumberOfBlocksToBeReplicated() >= 
replicationStreamsHardLimit(default is 4, can be changed by the 
propertydfs.namenode.replication.max-streams-hard-limit), it will continue and 
not update the numReplicas which will be used in the scheduleReconstruction 
method to judge whether it has enough replicas, and when it think the ec have 
enough replicas, it remove the block from neededReconstruction, this make the 
block never have chance to reconstruction.

Also I find hdfs monitor have check and deal with the excess block, but it 
depend on the numReplicas mark the internal block to EXCESS. As following 
snapshot, it the dup internal block is NOT mark to EXCESS, the monitor can NOT 
deal with.

Snapshot for the problem, have the dup internal blocks and missed 1 internal 
block, it will not reconstruction.

!image-2019-08-20-19-58-51-872.png!

> Erasure Coding: Can NOT trigger the reconstruction when have the dup internal 
> blocks and missing one internal block
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14699
>                 URL: https://issues.apache.org/jira/browse/HDFS-14699
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>    Affects Versions: 3.1.1
>            Reporter: Zhao Yi Ming
>            Assignee: Zhao Yi Ming
>            Priority: Critical
>              Labels: patch
>         Attachments: HDFS-14699.00.patch, image-2019-08-20-19-58-35-123.png, 
> image-2019-08-20-19-58-51-872.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said. Could we ask when and which version the fix can be 
> merged? Thanks! Following are our testing steps, hope it can 
> helpful.(following DNs have the testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to