[ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920846#comment-16920846
 ] 

Zhao Yi Ming edited comment on HDFS-14699 at 9/2/19 1:15 PM:
-------------------------------------------------------------

[~ayushtkn] Thanks for your review! For The UT part, because I added the new 
test case  testChooseSrcDatanodesWithDupEC which is used to test my fix. If you 
do not apply the patch, the new test case is not added, so the UT passed. 

Good point for the block index, I agree we do NOT need to recalculate the block 
index, I will try to fix this in next patch. But we can NOT put the 
liveBlockIndices.add(blockIndex) before following block, the reason is the EC 
reconstruction work will not be controlled by the replicationStreamsHardLimit 
configuration, if we move liveBlockIndices.add(blockIndex) before following 
block.  In this way it will introduce the DN high resource usage (CPU and 
Memory).

```
 if (node.getNumberOfBlocksToBeReplicated() >= replicationStreamsHardLimit)

{ continue; }

```


was (Author: zhaoyim):
[~ayushtkn] Thanks for your review! For The UT part, because I added the new 
test case  testChooseSrcDatanodesWithDupEC which is used to test my fix. If you 
do not apply the patch, the new test case is not added, so the UT passed. 

Good point for the block index, I agree we do NOT need to recalculate the block 
index, I will try to fix this in next patch. But we can NOT put the 
liveBlockIndices.add(blockIndex) before following block, the reason is the EC 
reconstruction work will not be controlled by the replicationStreamsHardLimit 
configuration, if we move liveBlockIndices.add(blockIndex) before following 
block.  In this way it will introduce the DN high resource usage (CPU and 
Memory).
 if (node.getNumberOfBlocksToBeReplicated() >= replicationStreamsHardLimit) {   
     continue;
      }

> Erasure Coding: Can NOT trigger the reconstruction when have the dup internal 
> blocks and missing one internal block
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14699
>                 URL: https://issues.apache.org/jira/browse/HDFS-14699
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>    Affects Versions: 3.2.0, 3.1.1, 3.3.0
>            Reporter: Zhao Yi Ming
>            Assignee: Zhao Yi Ming
>            Priority: Critical
>              Labels: patch
>         Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> image-2019-08-20-19-58-51-872.png, image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to