[jira] [Comment Edited] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

HuangTao (Jira) Tue, 10 Sep 2019 19:09:39 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927177#comment-16927177
 ]


HuangTao edited comment on HDFS-14699 at 9/11/19 2:08 AM:
----------------------------------------------------------

{quote}3. then shutdown one DN which did not have the same block id as 1 
decommission block, now we have 12 internal block(11 live block and 1 
decommission block)
4. after wait for about 600s (before the heart beat come) commission the 
decommissioned DN again, now we have 12 internal block(11 live block and 1 
duplicate block)
{quote}

{code:java}
// 
src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java:2314
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager#checkReplicaOnStorage
{code}
The above code has set the numReplicas.

{code:java}
          if (!bitSet.get(blockIndex)) {
            bitSet.set(blockIndex);
          } else if (state == StoredReplicaState.LIVE) {
            numReplicas.subtract(StoredReplicaState.LIVE, 1);
            numReplicas.add(StoredReplicaState.REDUNDANT, 1);
          }
{code}
I think this block is used to recorrect the numReplicas when there are some 
Nodes being decommissioning, it has nothing with over-hardlimit.

I think we should reconstruct the "1 decommission block" without the 
over-hardlimit srcNode, so I keep a doubt about this fix.


was (Author: marvelrock):
{quote}3. then shutdown one DN which did not have the same block id as 1 
decommission block, now we have 12 internal block(11 live block and 1 
decommission block)
4. after wait for about 600s (before the heart beat come) commission the 
decommissioned DN again, now we have 12 internal block(11 live block and 1 
duplicate block)
{quote}
I think we should reconstruct the "1 decommission block" without the 
over-hardlimit srcNode, so I keep a doubt about this fix.

> Erasure Coding: Storage not considered in live replica when replication 
> streams hard limit reached to threshold
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14699
>                 URL: https://issues.apache.org/jira/browse/HDFS-14699
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>    Affects Versions: 3.2.0, 3.1.1, 3.3.0
>            Reporter: Zhao Yi Ming
>            Assignee: Zhao Yi Ming
>            Priority: Critical
>              Labels: patch
>         Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, 
> image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

Reply via email to