[
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579462#comment-17579462
]
Neil Joshi commented on HDDS-7103:
----------------------------------
[~szetszwo] ,
{quote}In your case, Ratis won't create the second directory after the change.
It will throw an exception when it fails to read the existing directory.
{quote}
Should a failure occur when writing to one ratis group directory on a disk,
Ratis will _now_ fail throwing an exception and _not retry_ writing a new
directory on another volume. With this, how is the thrown exception handled by
Ozone? Does it mark the volume as unhealthy together with a datanode shutdown
for the admin to replace the failed volume? Or does the volume get marked
unhealthy, the system remains in service and Ratis uses a different unique
ratis group directory (than the one that failed) on another volume for the
ratis logs?
> Ratis log storage directories unchecked causing unhandled exception on
> datanode restart
> ---------------------------------------------------------------------------------------
>
> Key: HDDS-7103
> URL: https://issues.apache.org/jira/browse/HDDS-7103
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Neil Joshi
> Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple
> disks and there is a corruption causing the same directory found on each
> disk, ratis throws an unhandled exception. The unhandled exception prevents
> the datanode from creating pipelines. The datanode remains up with the user
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie.
> _dn1,dn2_ . Having the same directories in both disks. On datanode start
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1 | 2022-08-03 22:05:54 INFO XceiverServerRatis:481 -
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1 | 2022-08-03 22:05:54 WARN EndpointStateMachine:236 -
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1 | java.io.IOException: More than one directories found for
> 01a173a0-6bd2-478a-8598-05df3a6f318a:
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a,
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1 | at
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it. This issue had been
> identified and discussed in a previous PR for the hdds volume diskchecker, PR
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue. This
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue
> instead of throwing and unhandled IOException, see
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]