[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

Zhao Yi Ming (Jira) Sun, 08 Sep 2019 22:38:40 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925373#comment-16925373
 ]


Zhao Yi Ming commented on HDFS-14699:
-------------------------------------

[~ayushtkn] [~surendrasingh] Thanks for your review!

 

[~surendrasingh] 

1. I changed the issue title as your suggestion. Thanks!

2. The reason is the EC reconstruction work will not be controlled by the 
replicationStreamsHardLimit configuration, if move 
liveBlockIndices.add(blockIndex) before following block. If the DN is as more 
EC reconstruction works source node, all the reconstruction work need to read 
the data from it, the DN resource usage will be high, current fix make the DN 
under replicationStreamsHardLimit control, if the DN reach 
replicationStreamsHardLimit threshold, the DN will NOT be added in the source 
node list.

 

> Erasure Coding: Storage not considered in live replica when replication 
> streams hard limit reached to threshold
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14699
>                 URL: https://issues.apache.org/jira/browse/HDFS-14699
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>    Affects Versions: 3.2.0, 3.1.1, 3.3.0
>            Reporter: Zhao Yi Ming
>            Assignee: Zhao Yi Ming
>            Priority: Critical
>              Labels: patch
>         Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, 
> image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

Reply via email to