[
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598707#comment-17598707
]
Ethan Rose edited comment on HDDS-7103 at 9/1/22 2:15 AM:
----------------------------------------------------------
[~NeilJoshi] it looks like what will currently happen if the exception is
thrown is that the datanode will shut down. We could probably change the code
to catch this exception, skip loading the group, and queue a close pipeline
action to send to SCM once the datanode has registered. It looks like
RATIS-1677 [will be
reverted|https://github.com/apache/ratis/pull/718#issuecomment-1231215723] on
the Ratis 2.4.0 release branch so this won't be needed for Ozone 1.3.0. We will
also need to check the format options Ozone is using based on [this
comment|https://issues.apache.org/jira/browse/RATIS-1694?focusedCommentId=17598402&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17598402].
was (Author: erose):
[~NeilJoshi] it looks like what will currently happen if the exception is
thrown is that the datanode will shut down. We could probably change the code
to catch this exception, skip loading the group, and queue a close pipeline
action to send to SCM once the datanode has registered. It looks like
RATIS-1677 got reverted on the Ratis 2.4.0 release branch so this won't be
needed for Ozone 1.3.0. We will also need to check the format options Ozone is
using based on [this
comment|https://issues.apache.org/jira/browse/RATIS-1694?focusedCommentId=17598402&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17598402].
> Ratis log storage directories unchecked causing unhandled exception on
> datanode restart
> ---------------------------------------------------------------------------------------
>
> Key: HDDS-7103
> URL: https://issues.apache.org/jira/browse/HDDS-7103
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Neil Joshi
> Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple
> disks and there is a corruption causing the same directory found on each
> disk, ratis throws an unhandled exception. The unhandled exception prevents
> the datanode from creating pipelines. The datanode remains up with the user
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie.
> _dn1,dn2_ . Having the same directories in both disks. On datanode start
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1 | 2022-08-03 22:05:54 INFO XceiverServerRatis:481 -
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1 | 2022-08-03 22:05:54 WARN EndpointStateMachine:236 -
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1 | java.io.IOException: More than one directories found for
> 01a173a0-6bd2-478a-8598-05df3a6f318a:
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a,
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it. This issue had been
> identified and discussed in a previous PR for the hdds volume diskchecker, PR
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue. This
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue
> instead of throwing and unhandled IOException, see
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]