[jira] [Commented] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

Ethan Rose (Jira) Mon, 15 Aug 2022 14:17:06 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579919#comment-17579919
 ]


Ethan Rose commented on HDDS-7103:
----------------------------------

Thanks for the analysis [~szetszwo]
{quote}For Ozone and any other applications, it should start with 
RaftStorage.StartupOption.FORMAT only at the first time. When it restarts, it 
should use RaftStorage.StartupOption.RECOVER.
{quote}
For Ozone datanodes, I think this translates to passing FORMAT when the group 
is created on the datanode as part of the pipeline create command, and RECOVER 
on datanode startup. If Ratis throws an exception when loading a pipeline, the 
datanode needs to handle this by indicating to SCM that the pipeline is 
unhealthy and should be force closed, initially quasi closing its containers 
since Ratis is not available. That ratis storage volume should be marked as 
failed on the datanode so it is not used for future pipelines.

> Ratis log storage directories unchecked causing unhandled exception on 
> datanode restart
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-7103
>                 URL: https://issues.apache.org/jira/browse/HDDS-7103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple 
> disks and there is a corruption causing the same directory found on each 
> disk, ratis throws an unhandled exception.  The unhandled exception prevents 
> the datanode from creating pipelines.  The datanode remains up with the user 
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property 
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. 
> _dn1,dn2_ . Having the same directories in both disks.  On datanode start 
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1  | 2022-08-03 22:05:54 INFO  XceiverServerRatis:481 - 
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1  | 2022-08-03 22:05:54 WARN  EndpointStateMachine:236 - 
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1  | java.io.IOException: More than one directories found for 
> 01a173a0-6bd2-478a-8598-05df3a6f318a: 
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, 
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it.  This issue had been 
> identified and discussed in a previous PR for the hdds volume diskchecker, PR 
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue.  This 
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current 
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue 
> instead of throwing and unhandled IOException, see 
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

Reply via email to