[
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608673#comment-17608673
]
Lokesh Jain commented on HDDS-7103:
-----------------------------------
[~NeilJoshi] [~szetszwo] [~erose] Thanks for working on this! I was thinking we
should also alert SCM on the silent failure happening in the Datanode. I am not
sure if some of these disk (data + RATIS) errors are reported to SCM today and
if any subsequent action is taken on this. It would probably be good to show
these in the SCM UI.
> Ratis log storage directories unchecked causing unhandled exception on
> datanode restart
> ---------------------------------------------------------------------------------------
>
> Key: HDDS-7103
> URL: https://issues.apache.org/jira/browse/HDDS-7103
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Neil Joshi
> Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple
> disks and there is a corruption causing the same directory found on each
> disk, ratis throws an unhandled exception. The unhandled exception prevents
> the datanode from creating pipelines. The datanode remains up with the user
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie.
> _dn1,dn2_ . Having the same directories in both disks. On datanode start
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1 | 2022-08-03 22:05:54 INFO XceiverServerRatis:481 -
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1 | 2022-08-03 22:05:54 WARN EndpointStateMachine:236 -
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1 | java.io.IOException: More than one directories found for
> 01a173a0-6bd2-478a-8598-05df3a6f318a:
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a,
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it. This issue had been
> identified and discussed in a previous PR for the hdds volume diskchecker, PR
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue. This
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue
> instead of throwing and unhandled IOException, see
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]