sodonnel opened a new pull request #3048:
URL: https://github.com/apache/ozone/pull/3048
## What changes were proposed in this pull request?
Attempting to read a key less than 1 chunk, with 3 of the 5 nodes stopped
(both when not yet stale or stale), the read hangs for sometime and fails with:
```
$ ozone sh key get /vol1/bucket/ec1 /tmp/3_down
java.lang.IllegalStateException
at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:33)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.selectParityIndexes(ECBlockReconstructedStripeInputStream.java:432)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.init(ECBlockReconstructedStripeInputStream.java:179)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedStripeInputStream.readStripe(ECBlockReconstructedStripeInputStream.java:285)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedInputStream.readStripe(ECBlockReconstructedInputStream.java:192)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedInputStream.selectNextBuffer(ECBlockReconstructedInputStream.java:109)
at
org.apache.hadoop.ozone.client.io.ECBlockReconstructedInputStream.read(ECBlockReconstructedInputStream.java:83)
at
org.apache.hadoop.ozone.client.io.ECBlockInputStreamProxy.read(ECBlockInputStreamProxy.java:156)
at
org.apache.hadoop.ozone.client.io.ECBlockInputStreamProxy.read(ECBlockInputStreamProxy.java:171)
at
org.apache.hadoop.ozone.client.io.ECBlockInputStreamProxy.read(ECBlockInputStreamProxy.java:141)
at
org.apache.hadoop.hdds.scm.storage.ByteArrayReader.readFromBlock(ByteArrayReader.java:57)
at
org.apache.hadoop.ozone.client.io.KeyInputStream.readWithStrategy(KeyInputStream.java:268)
at
org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:235)
at
org.apache.hadoop.ozone.client.io.OzoneInputStream.read(OzoneInputStream.java:56)
at java.base/java.io.InputStream.read(InputStream.java:205)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
at
org.apache.hadoop.ozone.shell.keys.GetKeyHandler.execute(GetKeyHandler.java:88)
at org.apache.hadoop.ozone.shell.Handler.call(Handler.java:98)
at org.apache.hadoop.ozone.shell.Handler.call(Handler.java:44)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at
picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2172)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:2550)
at picocli.CommandLine.parseWithHandler(CommandLine.java:2485)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
at
org.apache.hadoop.ozone.shell.OzoneShell.lambda$execute$0(OzoneShell.java:55)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:159)
at org.apache.hadoop.ozone.shell.OzoneShell.execute(OzoneShell.java:53)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
at org.apache.hadoop.ozone.shell.OzoneShell.main(OzoneShell.java:47)
```
After the nodes are marked dead and the replicas no longer present in SCM,
we get the expected error immediately:
```
ozone sh key get /vol1/bucket/ec1 /tmp/3_down_dead
There are insufficient datanodes to read the EC block
```
This issue is caused by a bug in the sufficientLocations check. For a small
key in a container (under 1 chunk), there may be many other large keys in the
same container. This means that all data locations will be reported for the
parity + data blocks. However in the sufficientLocations method, we count the
available data blocks and then count the "padding only" blocks, resulting in
double counting the padding only blocks, and then the sufficientLocations check
passes when it should fail.
This change ensures that when counting the available data blocks for a key,
we limit the count based on the number of expected blocks.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6258
## How was this patch tested?
Reproduced via a unit test and then fixed the issue.
Also confirmed on Docker Compose the original error no longer occurs:
```
// The 3 DNs holdin data and parity are stale
Datanode: 9175f19e-184f-4592-9588-d356fd8efb4e
(/default-rack/172.19.0.10/ozone_datanode_2.ozone_default/1 pipelines)
Operational State: IN_SERVICE
Health State: STALE
Related pipelines:
be8ae2ce-64c0-4e6c-b007-1527361e7663/EC/ECReplicationConfig{data=3,
parity=2, ecChunkSize=1048576, codec=rs}/EC/CLOSED/Follower
Datanode: 9bcae82f-6aba-4325-8af1-6d2746af6232
(/default-rack/172.19.0.2/ozone_datanode_4.ozone_default/1 pipelines)
Operational State: IN_SERVICE
Health State: STALE
Related pipelines:
be8ae2ce-64c0-4e6c-b007-1527361e7663/EC/ECReplicationConfig{data=3,
parity=2, ecChunkSize=1048576, codec=rs}/EC/CLOSED/Follower
Datanode: 61112c1b-78a7-4689-a1d2-8fc7aa324e64
(/default-rack/172.19.0.8/ozone_datanode_1.ozone_default/1 pipelines)
Operational State: IN_SERVICE
Health State: STALE
Related pipelines:
be8ae2ce-64c0-4e6c-b007-1527361e7663/EC/ECReplicationConfig{data=3,
parity=2, ecChunkSize=1048576, codec=rs}/EC/CLOSED/Follower
bash-4.2$ ozone admin container info 1
Container id: 1
Pipeline id: be8ae2ce-64c0-4e6c-b007-1527361e7663
Container State: CLOSING
Datanodes:
[9bcae82f-6aba-4325-8af1-6d2746af6232/ozone_datanode_4.ozone_default,
79c581c2-0339-481a-82a6-6908d00f0af2/ozone_datanode_3.ozone_default,
75244723-cb13-484b-a256-b0f04411c249/ozone_datanode_5.ozone_default,
61112c1b-78a7-4689-a1d2-8fc7aa324e64/ozone_datanode_1.ozone_default,
9175f19e-184f-4592-9588-d356fd8efb4e/ozone_datanode_2.ozone_default]
Replicas: [State: OPEN; ReplicaIndex: 5; Origin:
9175f19e-184f-4592-9588-d356fd8efb4e; Location:
9175f19e-184f-4592-9588-d356fd8efb4e/ozone_datanode_2.ozone_default,
State: OPEN; ReplicaIndex: 4; Origin: 61112c1b-78a7-4689-a1d2-8fc7aa324e64;
Location: 61112c1b-78a7-4689-a1d2-8fc7aa324e64/ozone_datanode_1.ozone_default,
State: OPEN; ReplicaIndex: 1; Origin: 9bcae82f-6aba-4325-8af1-6d2746af6232;
Location: 9bcae82f-6aba-4325-8af1-6d2746af6232/ozone_datanode_4.ozone_default]
bash-4.2$ ozone sh key get /vol1/bucket/key1 /tmp/test2
There are insufficient datanodes to read the EC block
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]