[
https://issues.apache.org/jira/browse/HDDS-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Gui updated HDDS-6890:
---------------------------
Description:
We've encountered a potential issue on the read path when we have
over-replicated EC container replicas after Offline Recovery.
Here is the scenario:
# 20 DNs, EC 10+4
# A background script keep writing and reading keys.(freon ockg & ockv)
# A background script that randomly kills DNs (< 4 for each round), wait for
all containers to recover to the expected number of replicas(14). Then go for
next round.
# After there are 8 killed DNs, restart all dead DNs before.
# Then we have potentially over-replicated containers.
# At that time, there are concurrent ockv and container delete commands in the
cluster.
# And the ockv reports a digest mismatch during the short period of several
seconds.
# But after all the deletions are done and all containers return to a expected
number of replicas(14), the ockv command returns successfully.
Outputs for the ockv command below:
{code:java}
2022-06-10 13:21:27,923 [main] INFO storage.BlockInputStream: Unable to read
information for block conID: 12 locID: 109611004723202307 bcsId: 0 from
pipeline PipelineID=35db70e6-f362-4b3a-ab98-a89509547455: ContainerID 12 does
not exist
2022-06-10 13:21:28,613 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:29,614 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:30,614 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:31,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:32,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:33,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:33,978 [pool-2-thread-1] ERROR scm.XceiverClientGrpc: Failed
to execute command cmdType: GetBlock
traceID: ""
containerID: 12
datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
getBlock {
blockID {
containerID: 12
localID: 109611004723202307
blockCommitSequenceId: 0
}
}
on the pipeline Pipeline[ Id: 8151bbc7-672c-42d6-9c40-c80ce69d4b3f, Nodes:
5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97, ports:
[REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094,
STANDALONE=3093], networkLocation: /default-rack, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
CreationTimestamp2022-06-10T13:21:33.921+08:00[Asia/Shanghai]].
2022-06-10 13:21:34,014 [pool-2-thread-1] INFO storage.BlockInputStream: Unable
to read information for block conID: 12 locID: 109611004723202307 bcsId: 0 from
pipeline PipelineID=8151bbc7-672c-42d6-9c40-c80ce69d4b3f: ContainerID 12 does
not exist
2022-06-10 13:21:34,142 [pool-2-thread-5] ERROR scm.XceiverClientGrpc: Failed
to execute command cmdType: GetBlock
traceID: ""
containerID: 12
datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
getBlock {
blockID {
containerID: 12
localID: 109611004723202316
blockCommitSequenceId: 0
}
}
on the pipeline Pipeline[ Id: f421fc76-eebb-4474-8c77-c57944144df6, Nodes:
5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97, ports:
[REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094,
STANDALONE=3093], networkLocation: /default-rack, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
CreationTimestamp2022-06-10T13:21:34.074+08:00[Asia/Shanghai]].
2022-06-10 13:21:34,143 [pool-2-thread-5] INFO storage.BlockInputStream: Unable
to read information for block conID: 12 locID: 109611004723202316 bcsId: 0 from
pipeline PipelineID=f421fc76-eebb-4474-8c77-c57944144df6: ContainerID 12 does
not exist
2022-06-10 13:21:34,315 [pool-2-thread-7] ERROR scm.XceiverClientGrpc: Failed
to execute command cmdType: GetBlock
traceID: ""
containerID: 12
datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
getBlock {
blockID {
containerID: 12
localID: 109611004723202310
blockCommitSequenceId: 0
}
}
on the pipeline Pipeline[ Id: 21b59767-0ea9-4214-9f56-61fd31d795f2, Nodes:
5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97, ports:
[REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094,
STANDALONE=3093], networkLocation: /default-rack, certSerialId: null,
persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
CreationTimestamp2022-06-10T13:21:34.215+08:00[Asia/Shanghai]].
2022-06-10 13:21:34,315 [pool-2-thread-7] INFO storage.BlockInputStream: Unable
to read information for block conID: 12 locID: 109611004723202310 bcsId: 0 from
pipeline PipelineID=21b59767-0ea9-4214-9f56-61fd31d795f2: ContainerID 12 does
not exist
2022-06-10 13:21:35,402 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:36,418 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:37,419 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 % (0
out of 10)
2022-06-10 13:21:38,001 [pool-2-thread-5] ERROR freon.BaseFreonGenerator: Error
on executing task 4
java.lang.IllegalStateException: Reference (=first) message digest doesn't
match with digest of stripe_align_1654804614/4
at
org.apache.hadoop.ozone.freon.OzoneClientKeyValidator.validateDigest(OzoneClientKeyValidator.java:169)
at
org.apache.hadoop.ozone.freon.OzoneClientKeyValidator.validateKey(OzoneClientKeyValidator.java:133)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:184)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:164)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:147)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-06-10 13:21:38,419 [Thread-4] INFO freon.ProgressBar: Progress: 10.00 % (1
out of 10)
2022-06-10 13:21:39,419 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 %
(10 out of 10)
One ore more freon test is failed.
2022-06-10 13:21:39,823 [shutdown-hook-0] INFO metrics: type=TIMER,
name=key-validate, count=10, min=5918.240321, max=7710.826968,
mean=6891.752193143025, stddev=481.39096683740087, median=6798.957973,
p75=7302.544457, p95=7710.826968, p98=7710.826968, p99=7710.826968,
p999=7710.826968, mean_rate=1.1586365637687632, m1=0.0, m5=0.0, m15=0.0,
rate_unit=events/second, duration_unit=milliseconds
2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total
execution time (sec): 22
2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
Failures: 1
2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
Successful executions: 9 {code}
> EC: Potential read digest mismatch with over-replicated container.
> ------------------------------------------------------------------
>
> Key: HDDS-6890
> URL: https://issues.apache.org/jira/browse/HDDS-6890
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Mark Gui
> Priority: Major
>
> We've encountered a potential issue on the read path when we have
> over-replicated EC container replicas after Offline Recovery.
> Here is the scenario:
> # 20 DNs, EC 10+4
> # A background script keep writing and reading keys.(freon ockg & ockv)
> # A background script that randomly kills DNs (< 4 for each round), wait for
> all containers to recover to the expected number of replicas(14). Then go for
> next round.
> # After there are 8 killed DNs, restart all dead DNs before.
> # Then we have potentially over-replicated containers.
> # At that time, there are concurrent ockv and container delete commands in
> the cluster.
> # And the ockv reports a digest mismatch during the short period of several
> seconds.
> # But after all the deletions are done and all containers return to a
> expected number of replicas(14), the ockv command returns successfully.
> Outputs for the ockv command below:
> {code:java}
> 2022-06-10 13:21:27,923 [main] INFO storage.BlockInputStream: Unable to read
> information for block conID: 12 locID: 109611004723202307 bcsId: 0 from
> pipeline PipelineID=35db70e6-f362-4b3a-ab98-a89509547455: ContainerID 12 does
> not exist
> 2022-06-10 13:21:28,613 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:29,614 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:30,614 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:31,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:32,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:33,922 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:33,978 [pool-2-thread-1] ERROR scm.XceiverClientGrpc: Failed
> to execute command cmdType: GetBlock
> traceID: ""
> containerID: 12
> datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
> getBlock {
> blockID {
> containerID: 12
> localID: 109611004723202307
> blockCommitSequenceId: 0
> }
> }
> on the pipeline Pipeline[ Id: 8151bbc7-672c-42d6-9c40-c80ce69d4b3f, Nodes:
> 5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97,
> ports: [REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094,
> STANDALONE=3093], networkLocation: /default-rack, certSerialId: null,
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
> ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
> CreationTimestamp2022-06-10T13:21:33.921+08:00[Asia/Shanghai]].
> 2022-06-10 13:21:34,014 [pool-2-thread-1] INFO storage.BlockInputStream:
> Unable to read information for block conID: 12 locID: 109611004723202307
> bcsId: 0 from pipeline PipelineID=8151bbc7-672c-42d6-9c40-c80ce69d4b3f:
> ContainerID 12 does not exist
> 2022-06-10 13:21:34,142 [pool-2-thread-5] ERROR scm.XceiverClientGrpc: Failed
> to execute command cmdType: GetBlock
> traceID: ""
> containerID: 12
> datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
> getBlock {
> blockID {
> containerID: 12
> localID: 109611004723202316
> blockCommitSequenceId: 0
> }
> }
> on the pipeline Pipeline[ Id: f421fc76-eebb-4474-8c77-c57944144df6, Nodes:
> 5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97,
> ports: [REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094,
> STANDALONE=3093], networkLocation: /default-rack, certSerialId: null,
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
> ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
> CreationTimestamp2022-06-10T13:21:34.074+08:00[Asia/Shanghai]].
> 2022-06-10 13:21:34,143 [pool-2-thread-5] INFO storage.BlockInputStream:
> Unable to read information for block conID: 12 locID: 109611004723202316
> bcsId: 0 from pipeline PipelineID=f421fc76-eebb-4474-8c77-c57944144df6:
> ContainerID 12 does not exist
> 2022-06-10 13:21:34,315 [pool-2-thread-7] ERROR scm.XceiverClientGrpc: Failed
> to execute command cmdType: GetBlock
> traceID: ""
> containerID: 12
> datanodeUuid: "5b542d9f-5e91-4cc2-af48-af3c33b8c472"
> getBlock {
> blockID {
> containerID: 12
> localID: 109611004723202310
> blockCommitSequenceId: 0
> }
> }
> on the pipeline Pipeline[ Id: 21b59767-0ea9-4214-9f56-61fd31d795f2, Nodes:
> 5b542d9f-5e91-4cc2-af48-af3c33b8c472{ip: 9.27.142.97, host: 9.27.142.97,
> ports: [REPLICATION=3096, RATIS=3092, RATIS_ADMIN=3095, RATIS_SERVER=3094,
> STANDALONE=3093], networkLocation: /default-rack, certSerialId: null,
> persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0},
> ReplicationConfig: STANDALONE/ONE, State:CLOSED, leaderId:,
> CreationTimestamp2022-06-10T13:21:34.215+08:00[Asia/Shanghai]].
> 2022-06-10 13:21:34,315 [pool-2-thread-7] INFO storage.BlockInputStream:
> Unable to read information for block conID: 12 locID: 109611004723202310
> bcsId: 0 from pipeline PipelineID=21b59767-0ea9-4214-9f56-61fd31d795f2:
> ContainerID 12 does not exist
> 2022-06-10 13:21:35,402 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:36,418 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:37,419 [Thread-4] INFO freon.ProgressBar: Progress: 0.00 %
> (0 out of 10)
> 2022-06-10 13:21:38,001 [pool-2-thread-5] ERROR freon.BaseFreonGenerator:
> Error on executing task 4
> java.lang.IllegalStateException: Reference (=first) message digest doesn't
> match with digest of stripe_align_1654804614/4
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyValidator.validateDigest(OzoneClientKeyValidator.java:169)
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyValidator.validateKey(OzoneClientKeyValidator.java:133)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:184)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:164)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:147)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2022-06-10 13:21:38,419 [Thread-4] INFO freon.ProgressBar: Progress: 10.00 %
> (1 out of 10)
> 2022-06-10 13:21:39,419 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 %
> (10 out of 10)
> One ore more freon test is failed.
> 2022-06-10 13:21:39,823 [shutdown-hook-0] INFO metrics: type=TIMER,
> name=key-validate, count=10, min=5918.240321, max=7710.826968,
> mean=6891.752193143025, stddev=481.39096683740087, median=6798.957973,
> p75=7302.544457, p95=7710.826968, p98=7710.826968, p99=7710.826968,
> p999=7710.826968, mean_rate=1.1586365637687632, m1=0.0, m5=0.0, m15=0.0,
> rate_unit=events/second, duration_unit=milliseconds
> 2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Total execution time (sec): 22
> 2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Failures: 1
> 2022-06-10 13:21:39,824 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Successful executions: 9 {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]