[ 
https://issues.apache.org/jira/browse/HDDS-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Rose updated HDDS-7100:
-----------------------------
    Parent Issue: HDDS-7364  (was: HDDS-6548)

> Scrubber incorrectly marks containers unhealthy when DN is shutdown
> -------------------------------------------------------------------
>
>                 Key: HDDS-7100
>                 URL: https://issues.apache.org/jira/browse/HDDS-7100
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Datanode
>            Reporter: Stephen O'Donnell
>            Assignee: Szabolcs Gál
>            Priority: Critical
>
> When the DN is shutdown, the ContainerDataScanner.shutdown() method is called:
> {code}
>   public synchronized void shutdown() {
>     this.stopping = true;
>     this.canceler.cancel(
>         String.format(NAME_FORMAT, volume) + " is shutting down");
>     this.interrupt();
>     try {
>       this.join();
>     } catch (InterruptedException ex) {
>       LOG.warn("Unexpected exception while stopping data scanner for volume "
>           + volume, ex);
>       Thread.currentThread().interrupt();
>     }
>   }
> {code}
> This interrupts the current thread. The code to scan a container looks like:
> {code}
>   public boolean fullCheck(DataTransferThrottler throttler, Canceler 
> canceler) {
>     boolean valid;
>     try {
>       valid = fastCheck();
>       if (valid) {
>         scanData(throttler, canceler);
>       }
>     } catch (IOException e) {
>       handleCorruption(e);
>       valid = false;
>     }
>     return valid;
>   }
> {code}
> The interrupt causes the some method further down the stack to thrown an 
> exception, which is then caught by the IOException handler. Right now, it 
> assume any exception is due to the container being unhealthy, and marks the 
> container as such.
> Adding some debug code, we can see the real exception when this occurs is 
> "java.nio.channels.ClosedByInterruptException":
> {code}
> datanode_1  | 2022-08-05 12:08:51,676 [ContainerDataScanner(/data/hdds/hdds)] 
> INFO keyvalue.KeyValueContainerCheck: IO exception in checker
> datanode_1  | java.nio.channels.ClosedByInterruptException
> datanode_1  |         at 
> java.base/java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199)
> datanode_1  |         at 
> java.base/sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:162)
> datanode_1  |         at 
> java.base/sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:366)
> datanode_1  |         at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.verifyChecksum(KeyValueContainerCheck.java:295)
> datanode_1  |         at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.scanData(KeyValueContainerCheck.java:272)
> datanode_1  |         at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck.fullCheck(KeyValueContainerCheck.java:128)
> datanode_1  |         at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.scanData(KeyValueContainer.java:849)
> datanode_1  |         at 
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.runIteration(ContainerDataScanner.java:106)
> datanode_1  |         at 
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.run(ContainerDataScanner.java:81)
> {code}
> I am not sure if there could be other type of exception raised, so simply 
> catching ClosedByInterruptException is probably not a good solution. I feel 
> we should raise specific container integrity exceptions if the container 
> should be marked unhealthy, and the catch all IOException probably should not 
> be used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to