[
https://issues.apache.org/jira/browse/HDDS-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Gui updated HDDS-6373:
---------------------------
Description:
Container close due to container full will make DN reply a
ContainerNotOpenException to the Client, but it doesn't mean that this DN is
failed and should be excluded for new block group allocation. Otherwise we may
get many HEALTHY DNs to be excluded and new block group may fail to be
allocated in a small cluster.
E.g.
45 DNs(docker simulated), ozone-site.xml:
<property>
<name>ozone.scm.container.size</name>
<value>256MB</value>
</property>
<property>
<name>ozone.scm.block.size</name>
<value>16MB</value>
</property>
test with Freon ockg:
./bin/ozone freon ockg --type=EC --replication=rs-10-4-1024k -p test -n 10 -t
10 -s $((4 * 1024 * 1024 * 1024))
would result in a 5-8 failures with HDDS-6364 patched.
{code:java}
INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Allocated 0
blocks. Requested 1 blocks
at
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:660)
at
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:695)
at
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:309)
at
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:371)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.rewriteStripeToNewBlockGroup(ECKeyOutputStream.java:244)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.handleStripeFailure(ECKeyOutputStream.java:586)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.checkAndWriteParityCells(ECKeyOutputStream.java:306)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:192)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50)
at
org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76)
at
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:146)
at com.codahale.metrics.Timer.time(Timer.java:101)
at
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:143)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)
at
org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.IllegalArgumentException: Expected writeOffset=
1069543424 Expected offset=1059061760
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:564)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:61)
at
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:151)
... 8 more
One ore more freon test is failed.
2022-02-24 08:41:44,272 [shutdown-hook-0] INFO metrics: type=TIMER,
name=key-create, count=10, min=313491.661668, max=577254.304029,
mean=563762.9508485134, stddev=44787.24799551536, median=575542.093982,
p75=577254.304029, p95=577254.304029, p98=577254.304029, p99=577254.304029,
p999=577254.304029, mean_rate=0.017322637056902915, m1=0.029562618662863496,
m5=0.014855802773079099, m15=0.007191674083204336, rate_unit=events/second,
duration_unit=milliseconds
2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total
execution time (sec): 578
2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
Failures: 6
2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
Successful executions: 4 {code}
But with this fix and HDDS-6364 together, it shows all 10 success for many
rounds.
{code:java}
2022-02-24 10:56:45,013 [Thread-4] INFO freon.ProgressBar: Progress: 90.00 % (9
out of 10)
2022-02-24 10:56:46,013 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 %
(10 out of 10)
2022-02-24 10:56:46,257 [shutdown-hook-0] INFO metrics: type=TIMER,
name=key-create, count=10, min=958022.893372, max=1038271.448129,
mean=1018238.201558835, stddev=22083.604143242464, median=1029968.020144,
p75=1034239.403617, p95=1038271.448129, p98=1038271.448129, p99=1038271.448129,
p999=1038271.448129, mean_rate=0.009623163938983789, m1=0.09995782091693355,
m5=0.02731461121892791, m15=0.009684867189776935, rate_unit=events/second,
duration_unit=milliseconds
2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total
execution time (sec): 1040
2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
Failures: 0
2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
Successful executions: 10 {code}
was:
Container close due to container full will make DN reply a
ContainerNotOpenException to the Client, but it doesn't mean that this DN is
failed and should be excluded for new block group allocation. Otherwise we may
get many HEALTHY DNs to be excluded and new block group may fail to be
allocated in a small cluster.
E.g.
45 DNs(docker simulated), ozone-site.xml:
<property>
<name>ozone.scm.container.size</name>
<value>256MB</value>
</property>
<property>
<name>ozone.scm.block.size</name>
<value>16MB</value>
</property>
test with Freon ockg:
./bin/ozone freon ockg --type=EC --replication=rs-10-4-1024k -p test -n 10 -t
10 -s $((4 * 1024 * 1024 * 1024))
would result in a 5-8 failures with HDDS-6364 patched.
But with this fix and HDDS-6364 together, it shows all 10 success for many
rounds.
> EC: Exclude pipeline upon container close instead of exclude DNs.
> -----------------------------------------------------------------
>
> Key: HDDS-6373
> URL: https://issues.apache.org/jira/browse/HDDS-6373
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Mark Gui
> Assignee: Mark Gui
> Priority: Major
>
> Container close due to container full will make DN reply a
> ContainerNotOpenException to the Client, but it doesn't mean that this DN is
> failed and should be excluded for new block group allocation. Otherwise we
> may get many HEALTHY DNs to be excluded and new block group may fail to be
> allocated in a small cluster.
> E.g.
> 45 DNs(docker simulated), ozone-site.xml:
> <property>
> <name>ozone.scm.container.size</name>
> <value>256MB</value>
> </property>
> <property>
> <name>ozone.scm.block.size</name>
> <value>16MB</value>
> </property>
> test with Freon ockg:
> ./bin/ozone freon ockg --type=EC --replication=rs-10-4-1024k -p test -n 10 -t
> 10 -s $((4 * 1024 * 1024 * 1024))
> would result in a 5-8 failures with HDDS-6364 patched.
> {code:java}
> INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Allocated 0
> blocks. Requested 1 blocks
> at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:660)
> at
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:695)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:309)
> at
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:371)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.rewriteStripeToNewBlockGroup(ECKeyOutputStream.java:244)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.handleStripeFailure(ECKeyOutputStream.java:586)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.checkAndWriteParityCells(ECKeyOutputStream.java:306)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:192)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50)
> at
> org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76)
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:146)
> at com.codahale.metrics.Timer.time(Timer.java:101)
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:143)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)
> at
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Suppressed: java.lang.IllegalArgumentException: Expected writeOffset=
> 1069543424 Expected offset=1059061760
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:564)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:61)
> at
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:151)
> ... 8 more
> One ore more freon test is failed.
> 2022-02-24 08:41:44,272 [shutdown-hook-0] INFO metrics: type=TIMER,
> name=key-create, count=10, min=313491.661668, max=577254.304029,
> mean=563762.9508485134, stddev=44787.24799551536, median=575542.093982,
> p75=577254.304029, p95=577254.304029, p98=577254.304029, p99=577254.304029,
> p999=577254.304029, mean_rate=0.017322637056902915, m1=0.029562618662863496,
> m5=0.014855802773079099, m15=0.007191674083204336, rate_unit=events/second,
> duration_unit=milliseconds
> 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Total execution time (sec): 578
> 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Failures: 6
> 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Successful executions: 4 {code}
> But with this fix and HDDS-6364 together, it shows all 10 success for many
> rounds.
> {code:java}
> 2022-02-24 10:56:45,013 [Thread-4] INFO freon.ProgressBar: Progress: 90.00 %
> (9 out of 10)
> 2022-02-24 10:56:46,013 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 %
> (10 out of 10)
> 2022-02-24 10:56:46,257 [shutdown-hook-0] INFO metrics: type=TIMER,
> name=key-create, count=10, min=958022.893372, max=1038271.448129,
> mean=1018238.201558835, stddev=22083.604143242464, median=1029968.020144,
> p75=1034239.403617, p95=1038271.448129, p98=1038271.448129,
> p99=1038271.448129, p999=1038271.448129, mean_rate=0.009623163938983789,
> m1=0.09995782091693355, m5=0.02731461121892791, m15=0.009684867189776935,
> rate_unit=events/second, duration_unit=milliseconds
> 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Total execution time (sec): 1040
> 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Failures: 0
> 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator:
> Successful executions: 10 {code}
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]