[
https://issues.apache.org/jira/browse/HDDS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose updated HDDS-7100:
-----------------------------
Parent Issue: HDDS-7364 (was: HDDS-6548)
> Scrubber incorrectly marks containers unhealthy when DN is shutdown
> -------------------------------------------------------------------
>
> Key: HDDS-7100
> URL: https://issues.apache.org/jira/browse/HDDS-7100
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: Ozone Datanode
> Reporter: Stephen O'Donnell
> Assignee: Szabolcs Gál
> Priority: Critical
>
> When the DN is shutdown, the ContainerDataScanner.shutdown() method is called:
> {code}
> public synchronized void shutdown() {
> this.stopping = true;
> this.canceler.cancel(
> String.format(NAME_FORMAT, volume) + " is shutting down");
> this.interrupt();
> try {
> this.join();
> } catch (InterruptedException ex) {
> LOG.warn("Unexpected exception while stopping data scanner for volume "
> + volume, ex);
> Thread.currentThread().interrupt();
> }
> }
> {code}
> This interrupts the current thread. The code to scan a container looks like:
> {code}
> public boolean fullCheck(DataTransferThrottler throttler, Canceler
> canceler) {
> boolean valid;
> try {
> valid = fastCheck();
> if (valid) {
> scanData(throttler, canceler);
> }
> } catch (IOException e) {
> handleCorruption(e);
> valid = false;
> }
> return valid;
> }
> {code}
> The interrupt causes the some method further down the stack to thrown an
> exception, which is then caught by the IOException handler. Right now, it
> assume any exception is due to the container being unhealthy, and marks the
> container as such.
> Adding some debug code, we can see the real exception when this occurs is
> "java.nio.channels.ClosedByInterruptException":
> {code}
> datanode_1 | 2022-08-05 12:08:51,676 [ContainerDataScanner(/data/hdds/hdds)]
> INFO keyvalue.KeyValueContainerCheck: IO exception in checker
> datanode_1 | java.nio.channels.ClosedByInterruptException
> datanode_1 | at
> java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
> datanode_1 | at
> java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
> datanode_1 | at
> java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:366)
> datanode_1 | at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.verifyChecksum(KeyValueContainerCheck.java:295)
> datanode_1 | at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanData(KeyValueContainerCheck.java:272)
> datanode_1 | at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.fullCheck(KeyValueContainerCheck.java:128)
> datanode_1 | at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.scanData(KeyValueContainer.java:849)
> datanode_1 | at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.runIteration(ContainerDataScanner.java:106)
> datanode_1 | at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.run(ContainerDataScanner.java:81)
> {code}
> I am not sure if there could be other type of exception raised, so simply
> catching ClosedByInterruptException is probably not a good solution. I feel
> we should raise specific container integrity exceptions if the container
> should be marked unhealthy, and the catch all IOException probably should not
> be used.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]