[
https://issues.apache.org/jira/browse/HDDS-6342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-6342:
---------------------------------
Labels: pull-request-available (was: )
> EC: Fix large write with multiple stripes upon stripe failure.
> --------------------------------------------------------------
>
> Key: HDDS-6342
> URL: https://issues.apache.org/jira/browse/HDDS-6342
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Mark Gui
> Assignee: Mark Gui
> Priority: Major
> Labels: pull-request-available
>
> Test with ockg
> ./bin/ozone freon ockg -p test -n 50 -t 8 -s $((500*1024*1024)) --type=EC
> --replication=rs-10-4-1024k
> {code:java}
> 2022-02-15 12:43:11,295 [pool-2-thread-7] ERROR freon.BaseFreonGenerator:
> Error on executing task 46
> java.lang.IllegalArgumentException
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:130)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:327)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:536)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:61)
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:150)
> at com.codahale.metrics.Timer.time(Timer.java:101)
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:142)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) {code}
> This happens only when write happen failure during parity write and there are
> > 1 already written stripes in the current block group.
> Upon this a new block group is picked for retrying the current stripe write,
> and the current block group should rollback its current position, the bug
> lies within the calculation of the acked length of the block group.
> Code references:
> {code:java}
> if (handleParityWrites(ecChunkSize, allocateBlockIfFull,
> shouldClose) == StripeWriteStatus.FAILED) {
> handleStripeFailure(numDataBlks * ecChunkSize, allocateBlockIfFull,
> shouldClose);
> } else {
> // At this stage stripe write is successful.
> currentStreamEntry.updateBlockGroupToAckedPosition(
> currentStreamEntry.getCurrentPosition());
> } {code}
> {code:java}
> private StripeWriteStatus rewriteStripeToNewBlockGroup(
> int failedStripeDataSize, boolean allocateBlockIfFull, boolean close)
> throws IOException {
> long[] failedDataStripeChunkLens = new long[numDataBlks];
> long[] failedParityStripeChunkLens = new long[numParityBlks];
> final ByteBuffer[] dataBuffers = ecChunkBufferCache.getDataBuffers();
> for (int i = 0; i < numDataBlks; i++) {
> failedDataStripeChunkLens[i] = dataBuffers[i].limit();
> }
> final ByteBuffer[] parityBuffers = ecChunkBufferCache.getParityBuffers();
> for (int i = 0; i < numParityBlks; i++) {
> failedParityStripeChunkLens[i] = parityBuffers[i].limit();
> }
> blockOutputStreamEntryPool.getCurrentStreamEntry().resetToFirstEntry();
> // Rollback the length/offset updated as part of this failed stripe write.
> offset -= failedStripeDataSize;
> blockOutputStreamEntryPool.getCurrentStreamEntry()
> .resetToAckedPosition(); <-- wrong position
> deteced
> ...
> } {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]