[ 
https://issues.apache.org/jira/browse/HDDS-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853882#comment-17853882
 ] 

GuoHao commented on HDDS-10985:
-------------------------------

At present, when a block in the container fails to be repaired, the entire 
container fails to be repaired. In the EC writing process, the success of the 
entire stripe writing is controlled by the client. When a block fails to be 
written, it may be due to the failure of several replica indexes. Before OM 
triggers the open key cleanup, this block is in the wild. If it does not meet 
the minimum number of blocks that the entire stripe can be repaired, it will 
affect the repair of the EC.

The same situation may also exist when the progress of each replica index is 
different when deleting a block.

cc [~sammichen] [~adoroszlai] [~sodonnell] [~siddhant] 

> EC Reconstruction failed because the size of currentChunks was not equal to 
> checksumBlockDataChunks
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-10985
>                 URL: https://issues.apache.org/jira/browse/HDDS-10985
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: EC
>            Reporter: LiMinyu
>            Priority: Critical
>
> EC reconstruction failed with *java.lang.IllegalArgumentException: The chunk 
> list has 9 entries, but the checksum chunks has 10 entries. They should be 
> equal in size* exception. The DN had this problen when the EC data was 
> reconstructed. And I found that this problem can occur whether the data block 
> or the check block is missing.
> *EC Policy:* rs-10-3-2048k
> *DN.log:* 
> {code:java}
> 2024-06-06 18:20:17,837 [ContainerReplicationThread-12] WARN 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask:
>  FAILED reconstructECContainersCommand: containerID=876481, 
> replication=rs-10-3-2048k, missingIndexes=[11], sources={1=5919f690
> -3871-45d2-b414-004292b3e2d3(10.175.134.153/10.175.134.153), 
> 2=718b671b-66ae-46eb-96fb-71411da7849d(10.175.134.172/10.175.134.172), 
> 3=e0ce60b3-75d5-4d00-bcb9-7781ef61e827(10.175.134.135/10.175.134.135), 
> 4=e9871cb6-44b0-4f39-ac8d-b04122dbd439(10.175.134.201/10.175.134.201), 
> 5=b9319384-2f73-4610-9e03-c6b67bbfab0b(10.175.134.217/10.175.134.217), 
> 6=9a0f6ff9-0772-4a1d-828e-96d3be50778c(10.175.134.199/10.175.134.199), 
> 7=8c0800ad-0026-4fdd-bd6e-6d866e166e49(10.175.137.25/10.175.137.25), 
> 8=24628bc9-5d7b-4310-a21f-9a35e2634fb4(10.175.134.200/10.175.134.200), 
> 9=c23a4a3c-183a-4baf-ada4-e30800faa907(10.175.134.219/10.175.134.219), 
> 10=c02658fa-898a-4406-a778-87653c2723c2(10.175.137.27/10.175.137.27), 
> 12=2a598049-6f33-4f18-a32a-f9d1f2ad399d(10.175.137.43/10.175.137.43), 
> 13=70cfa62e-5a7c-489e-bdf3-5527f9bb1679(10.175.134.203/10.175.134.203)}, 
> targets={11=099a12a7-e276-4ce0-bb3d-d915879ba4d9(10.175.138.92/10.175.138.92)}
>  after 316099 ms
> java.lang.IllegalArgumentException: The chunk list has 9 entries, but the 
> checksum chunks has 10 entries. They should be equal in size.
>         at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
>         at 
> org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:144)
>         at 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:340)
>         at 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:180)
>         at 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
>         at 
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to