[ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920827#comment-16920827
 ] 

Ayush Saxena commented on HDFS-14699:
-------------------------------------

[~zhaoyim] I am talking about the code. Let me try to be more clear.
* The unit test that you wrote, is to check the scenario that you are reporting 
and fixing. So ideally this UT should fail without your fix, and after your fix 
it should pass. So, The UT you wrote, passes instead of failing, if I remove 
your fix too. That means it doesn't verifies the scenario. You remove your fix 
and just put the UT and run, it passes, which ideally without your fix should 
fail.
* The if part I am talking about is :

{code:java}
      if(isStriped || srcNodes.isEmpty()) {
        srcNodes.add(node);
        if (isStriped) {
          byte blockIndex = ((BlockInfoStriped) block).
              getStorageBlockIndex(storage);
          liveBlockIndices.add(blockIndex);
          if (!bitSet.get(blockIndex)) {
            bitSet.set(blockIndex);
          } else if (state == StoredReplicaState.LIVE) {
            numReplicas.subtract(StoredReplicaState.LIVE, 1);
            numReplicas.add(StoredReplicaState.REDUNDANT, 1);
          }
        }
        continue;
      }
{code}

You pulled up, a part of it leaving behind 
{{liveBlockIndices.add(blockIndex);}} for which we have to recalculate block 
Index. Can we not pull up the whole if block including these line also, above :

{code:java}
      if (node.getNumberOfBlocksToBeReplicated() >= 
replicationStreamsHardLimit) {
        continue;
      }
{code}

Or you have left it below for some specific reason, if not, we can have the 
whole block above.


> Erasure Coding: Can NOT trigger the reconstruction when have the dup internal 
> blocks and missing one internal block
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14699
>                 URL: https://issues.apache.org/jira/browse/HDFS-14699
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec
>    Affects Versions: 3.2.0, 3.1.1, 3.3.0
>            Reporter: Zhao Yi Ming
>            Assignee: Zhao Yi Ming
>            Priority: Critical
>              Labels: patch
>         Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> image-2019-08-20-19-58-51-872.png, image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to