Stephen O'Donnell created HDDS-10682:
----------------------------------------
Summary: EC Reconstruction creates empty chunks at the end of
blocks with partial stripes
Key: HDDS-10682
URL: https://issues.apache.org/jira/browse/HDDS-10682
Project: Apache Ozone
Issue Type: Bug
Reporter: Stephen O'Donnell
Assignee: Stephen O'Donnell
Given an EC block that is larger than 1 full stripe, but the last stripe is
partial so that it does not use all the index.
If one of the replicas is reconstructed that does not have any data in that
final position, an empty chunk is written to the end of the block's chunk list.
While this does no cause any immediate problem, it can prevent further
reconstructions that attempt to use this block, and they will fail with an
error like:
{code}
2024-04-09 01:06:21,855 ERROR
[ec-reconstruct-reader-TID-4]-org.apache.hadoop.hdds.scm.XceiverClientGrpc:
Failed to execute command GetBlock on the pipeline Pipeline[ Id:
7f6f1fc9-ed26-4e19-86b6-47435b027f6a, Nodes:
7f6f1fc9-ed26-4e19-86b6-47435b027f6a(ccycloud-4.quasar-jyswng.root.comops.site/10.140.150.0),
ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
CreationTimestamp2024-04-09T01:06:21.724509Z[UTC]].
2024-04-09 01:06:21,859 INFO
[ContainerReplicationThread-1]-org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream:
ECBlockReconstructedStripeInputStream{conID: 10007 locID:
113750153625610009}@756a3998: error reading [1], marked as failed
org.apache.hadoop.ozone.client.io.BadDataLocationException:
java.io.IOException: Failed to get chunkInfo[77]: len == 0
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readIntoBuffer(ECBlockReconstructedStripeInputStream.java:644)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.lambda$loadDataBuffersFromStream$2(ECBlockReconstructedStripeInputStream.java:577)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.io.IOException: Failed to get chunkInfo[77]: len == 0
at
org.apache.hadoop.hdds.scm.storage.BlockInputStream.validate(BlockInputStream.java:278)
at
org.apache.hadoop.hdds.scm.storage.BlockInputStream.lambda$static$0(BlockInputStream.java:265)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:407)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:347)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:342)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:323)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:208)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$getBlock$0(ContainerProtocolCalls.java:186)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.tryEachDatanode(ContainerProtocolCalls.java:146)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:185)
at
org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:255)
at
org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:146)
at
org.apache.hadoop.hdds.scm.storage.BlockInputStream.readWithStrategy(BlockInputStream.java:308)
at
org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:66)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readFromCurrentLocation(ECBlockReconstructedStripeInputStream.java:655)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readIntoBuffer(ECBlockReconstructedStripeInputStream.java:631)
... 5 more
{code}
If there are other spare replicas which can be used, reconstruction will
continue, otherwise it will not be able to complete.
At this stage, I am not sure if this can affect reading a block via the normal
read path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]