[
https://issues.apache.org/jira/browse/HDDS-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854640#comment-17854640
]
Shilun Fan commented on HDDS-10985:
-----------------------------------
[~Nicholas Niu] Thank you very much for your response! I personally believe
that the cause of this issue is the parallel transmission of data between
EC-stripes. We should ensure that EC-stripes are sent sequentially, but within
a single stripe, the chunks should still be sent to different DNs in parallel.
Therefore, I am considering setting the size of the Rpc-Client's writeExecutor
to 1.
Rpc-Client
{code:java}
this.writeExecutor = MemoizedSupplier.valueOf(() -> createThreadPoolExecutor(
WRITE_POOL_MIN_SIZE, Integer.MAX_VALUE, "client-write-TID-%d"));
private KeyOutputStream.Builder createKeyOutputStream(
OpenKeySession openKey) {
KeyOutputStream.Builder builder;
ReplicationConfig replicationConfig =
openKey.getKeyInfo().getReplicationConfig();
StreamBufferArgs streamBufferArgs =
StreamBufferArgs.getDefaultStreamBufferArgs(
replicationConfig, clientConfig);
if (replicationConfig.getReplicationType() ==
HddsProtos.ReplicationType.EC) {
builder = new ECKeyOutputStream.Builder()
.setReplicationConfig((ECReplicationConfig) replicationConfig)
.setByteBufferPool(byteBufferPool)
.setS3CredentialsProvider(getS3CredentialsProvider());
} else {
builder = new KeyOutputStream.Builder()
.setReplicationConfig(replicationConfig);
}
return builder.setHandler(openKey)
.setXceiverClientManager(xceiverClientManager)
.setOmClient(ozoneManagerClient)
.enableUnsafeByteBufferConversion(unsafeByteBufferConversion)
.setConfig(clientConfig)
.setAtomicKeyCreation(isS3GRequest.get())
.setClientMetrics(clientMetrics)
.setExecutorServiceSupplier(writeExecutor)
.setStreamBufferArgs(streamBufferArgs);
}
{code}
However, I am not sure if this is a reasonable solution or if there might be
other potential issues.
> EC Reconstruction failed because the size of currentChunks was not equal to
> checksumBlockDataChunks
> ---------------------------------------------------------------------------------------------------
>
> Key: HDDS-10985
> URL: https://issues.apache.org/jira/browse/HDDS-10985
> Project: Apache Ozone
> Issue Type: Bug
> Components: EC
> Reporter: LiMinyu
> Priority: Critical
>
> EC reconstruction failed with *java.lang.IllegalArgumentException: The chunk
> list has 9 entries, but the checksum chunks has 10 entries. They should be
> equal in size* exception. The DN had this problen when the EC data was
> reconstructed. And I found that this problem can occur whether the data block
> or the check block is missing.
> *EC Policy:* rs-10-3-2048k
> *DN.log:*
> {code:java}
> 2024-06-06 18:20:17,837 [ContainerReplicationThread-12] WARN
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask:
> FAILED reconstructECContainersCommand: containerID=876481,
> replication=rs-10-3-2048k, missingIndexes=[11], sources={1=5919f690
> -3871-45d2-b414-004292b3e2d3(10.175.134.153/10.175.134.153),
> 2=718b671b-66ae-46eb-96fb-71411da7849d(10.175.134.172/10.175.134.172),
> 3=e0ce60b3-75d5-4d00-bcb9-7781ef61e827(10.175.134.135/10.175.134.135),
> 4=e9871cb6-44b0-4f39-ac8d-b04122dbd439(10.175.134.201/10.175.134.201),
> 5=b9319384-2f73-4610-9e03-c6b67bbfab0b(10.175.134.217/10.175.134.217),
> 6=9a0f6ff9-0772-4a1d-828e-96d3be50778c(10.175.134.199/10.175.134.199),
> 7=8c0800ad-0026-4fdd-bd6e-6d866e166e49(10.175.137.25/10.175.137.25),
> 8=24628bc9-5d7b-4310-a21f-9a35e2634fb4(10.175.134.200/10.175.134.200),
> 9=c23a4a3c-183a-4baf-ada4-e30800faa907(10.175.134.219/10.175.134.219),
> 10=c02658fa-898a-4406-a778-87653c2723c2(10.175.137.27/10.175.137.27),
> 12=2a598049-6f33-4f18-a32a-f9d1f2ad399d(10.175.137.43/10.175.137.43),
> 13=70cfa62e-5a7c-489e-bdf3-5527f9bb1679(10.175.134.203/10.175.134.203)},
> targets={11=099a12a7-e276-4ce0-bb3d-d915879ba4d9(10.175.138.92/10.175.138.92)}
> after 316099 ms
> java.lang.IllegalArgumentException: The chunk list has 9 entries, but the
> checksum chunks has 10 entries. They should be equal in size.
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
> at
> org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:144)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:340)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:180)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
> at
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750) {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]