[ 
https://issues.apache.org/jira/browse/HDDS-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855003#comment-17855003
 ] 

Stephen O'Donnell commented on HDDS-10985:
------------------------------------------

[~slfan1989]

{quote}
1. ECKeyOutputStream#handleWrite uses blockingqueue. Stripe will be written to 
ecStripeQueue. Then a separate thread pool will take out Stripe and send it to 
the DN in the pipeline according to the chunk.

2. DN data writing uses an asynchronous method, which can enhance efficiency.
BlockOutputStream#writeChunkToContainer
{quote}

For (1) - I believe there is a single thread that takes stripes from the 
ecStripeQueue. It should not take the next stripe until the previous one is 
committed and all the async calls to the DN have completed.

The code to start the queue thread is in ECKeyOutputStream:

{code}
    this.flushFuture = builder.getExecutorServiceSupplier().get().submit(() -> {
      s3CredentialsProvider.set(s3Auth);
      return flushStripeFromQueue();
    });
{code}

FlushStripeFromQueue does not exit unless there is a problem or the stream is 
closed:

{code}
  private boolean flushStripeFromQueue() throws IOException {
    try {
      ECChunkBuffers stripe = ecStripeQueue.take();
      while (!closing && !(stripe instanceof EOFDummyStripe)) {
        if (stripe instanceof CheckpointDummyStripe) {
          flushCheckpoint.set(((CheckpointDummyStripe) stripe).version);
        } else {
          flushStripeToDatanodes(stripe);
          stripe.release();
        }
        stripe = ecStripeQueue.take();
      }
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
      throw new IOException("Interrupted while polling stripe from queue", e);
    }
    return true;
  }
{code}

>From this, I am not sure if there is a way for the putBlock calls to arrive at 
>a DN out of order.

> EC Reconstruction failed because the size of currentChunks was not equal to 
> checksumBlockDataChunks
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-10985
>                 URL: https://issues.apache.org/jira/browse/HDDS-10985
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: EC
>            Reporter: LiMinyu
>            Priority: Critical
>
> EC reconstruction failed with *java.lang.IllegalArgumentException: The chunk 
> list has 9 entries, but the checksum chunks has 10 entries. They should be 
> equal in size* exception. The DN had this problen when the EC data was 
> reconstructed. And I found that this problem can occur whether the data block 
> or the check block is missing.
> *EC Policy:* rs-10-3-2048k
> *DN.log:* 
> {code:java}
> 2024-06-06 18:20:17,837 [ContainerReplicationThread-12] WARN 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask:
>  FAILED reconstructECContainersCommand: containerID=876481, 
> replication=rs-10-3-2048k, missingIndexes=[11], sources={1=5919f690
> -3871-45d2-b414-004292b3e2d3(10.175.134.153/10.175.134.153), 
> 2=718b671b-66ae-46eb-96fb-71411da7849d(10.175.134.172/10.175.134.172), 
> 3=e0ce60b3-75d5-4d00-bcb9-7781ef61e827(10.175.134.135/10.175.134.135), 
> 4=e9871cb6-44b0-4f39-ac8d-b04122dbd439(10.175.134.201/10.175.134.201), 
> 5=b9319384-2f73-4610-9e03-c6b67bbfab0b(10.175.134.217/10.175.134.217), 
> 6=9a0f6ff9-0772-4a1d-828e-96d3be50778c(10.175.134.199/10.175.134.199), 
> 7=8c0800ad-0026-4fdd-bd6e-6d866e166e49(10.175.137.25/10.175.137.25), 
> 8=24628bc9-5d7b-4310-a21f-9a35e2634fb4(10.175.134.200/10.175.134.200), 
> 9=c23a4a3c-183a-4baf-ada4-e30800faa907(10.175.134.219/10.175.134.219), 
> 10=c02658fa-898a-4406-a778-87653c2723c2(10.175.137.27/10.175.137.27), 
> 12=2a598049-6f33-4f18-a32a-f9d1f2ad399d(10.175.137.43/10.175.137.43), 
> 13=70cfa62e-5a7c-489e-bdf3-5527f9bb1679(10.175.134.203/10.175.134.203)}, 
> targets={11=099a12a7-e276-4ce0-bb3d-d915879ba4d9(10.175.138.92/10.175.138.92)}
>  after 316099 ms
> java.lang.IllegalArgumentException: The chunk list has 9 entries, but the 
> checksum chunks has 10 entries. They should be equal in size.
>         at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:143)
>         at 
> org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:144)
>         at 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:340)
>         at 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:180)
>         at 
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
>         at 
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to