[
https://issues.apache.org/jira/browse/HDDS-5032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sammi Chen updated HDDS-5032:
-----------------------------
Description:
We have met two cases of container loading exceptions, one case is fixed by
HDDS-4722 which throws out Runtime Exception, another case is I backuped a
container dirctory using name ContainerID-Backup which triggers bad formated
container directory name exception.
The consequence of these two cases are the massive containers lefting on the
same volume are not loaded. While DN is started and running healthly, SCM
treats all these container replicas as missing and starts to schedule many
replica replication tasks.
This task is to fix the issue. If there is specific container loading
exception, LOG it, and go to load next container.
Case 1:
2021-03-12 20:46:16,420 [Thread-8] ERROR
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader: Caught a Run time
exception during reading container files from Volume /data3/hdds/hdds {}
java.lang.NumberFormatException: For input string: "1823-raw"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at
org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.getContainerID(ContainerUtils.java:242)
at
org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.getContainerFile(ContainerUtils.java:234)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.readVolume(ContainerReader.java:132)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.run(ContainerReader.java:91)
at java.lang.Thread.run(Thread.java:748)
Case2:
2021-03-25 10:15:47,502 [Thread-15] ERROR
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader: Caught a Run time
exception during reading container files from Volume /data5/hdds/hdds {}
org.apache.hadoop.metrics2.MetricsException: Metrics source RDBMetrics already
exists!
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at org.apache.hadoop.hdds.utils.db.RDBMetrics.create(RDBMetrics.java:47)
at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:152)
at
org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:191)
at
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(AbstractDatanodeStore.java:128)
at
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(AbstractDatanodeStore.java:103)
at
org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaOneImpl.<init>(DatanodeStoreSchemaOneImpl.java:40)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(BlockUtils.java:68)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(BlockUtils.java:93)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.parseKVContainerData(KeyValueContainerUtil.java:195)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.verifyAndFixupContainerData(ContainerReader.java:181)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.verifyContainerFile(ContainerReader.java:158)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.readVolume(ContainerReader.java:136)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.run(ContainerReader.java:91)
at java.lang.Thread.run(Thread.java:748)
was:
We have met two cases of container loading exceptions, one case is fixed by
HDDS-4722 which throws out Runtime Exception, another case is I backuped a
container dirctory using name ContainerID-Backup which triggers bad formated
container directory name exception.
The consequence of these two cases are the massive containers lefting on the
same volume are not loaded. While DN is started and running healthly, SCM
treats all these container replicas as missing and starts to schedule many
replica replication tasks.
This task is to fix the issue. If there is specific container loading
exception, LOG it, and go to load next container.
2021-03-12 20:46:16,420 [Thread-8] ERROR
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader: Caught a Run time
exception during reading container files from Volume /data3/hdds/hdds {}
java.lang.NumberFormatException: For input string: "1823-raw"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at
org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.getContainerID(ContainerUtils.java:242)
at
org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.getContainerFile(ContainerUtils.java:234)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.readVolume(ContainerReader.java:132)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.run(ContainerReader.java:91)
at java.lang.Thread.run(Thread.java:748)
> DN stopped to load containers on volume after a container load exception
> ------------------------------------------------------------------------
>
> Key: HDDS-5032
> URL: https://issues.apache.org/jira/browse/HDDS-5032
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Critical
>
> We have met two cases of container loading exceptions, one case is fixed by
> HDDS-4722 which throws out Runtime Exception, another case is I backuped a
> container dirctory using name ContainerID-Backup which triggers bad formated
> container directory name exception.
> The consequence of these two cases are the massive containers lefting on the
> same volume are not loaded. While DN is started and running healthly, SCM
> treats all these container replicas as missing and starts to schedule many
> replica replication tasks.
> This task is to fix the issue. If there is specific container loading
> exception, LOG it, and go to load next container.
> Case 1:
> 2021-03-12 20:46:16,420 [Thread-8] ERROR
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader: Caught a Run
> time exception during reading container files from Volume /data3/hdds/hdds {}
> java.lang.NumberFormatException: For input string: "1823-raw"
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Long.parseLong(Long.java:589)
> at java.lang.Long.parseLong(Long.java:631)
> at
> org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.getContainerID(ContainerUtils.java:242)
> at
> org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.getContainerFile(ContainerUtils.java:234)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.readVolume(ContainerReader.java:132)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.run(ContainerReader.java:91)
> at java.lang.Thread.run(Thread.java:748)
> Case2:
> 2021-03-25 10:15:47,502 [Thread-15] ERROR
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader: Caught a Run
> time exception during reading container files from Volume /data5/hdds/hdds {}
> org.apache.hadoop.metrics2.MetricsException: Metrics source RDBMetrics
> already exists!
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
> at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> at
> org.apache.hadoop.hdds.utils.db.RDBMetrics.create(RDBMetrics.java:47)
> at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:152)
> at
> org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:191)
> at
> org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(AbstractDatanodeStore.java:128)
> at
> org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(AbstractDatanodeStore.java:103)
> at
> org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaOneImpl.<init>(DatanodeStoreSchemaOneImpl.java:40)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(BlockUtils.java:68)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(BlockUtils.java:93)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerUtil.parseKVContainerData(KeyValueContainerUtil.java:195)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.verifyAndFixupContainerData(ContainerReader.java:181)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.verifyContainerFile(ContainerReader.java:158)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.readVolume(ContainerReader.java:136)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerReader.run(ContainerReader.java:91)
> at java.lang.Thread.run(Thread.java:748)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]