[jira] [Commented] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

Lokesh Jain (Jira) Fri, 23 Sep 2022 03:08:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608673#comment-17608673
 ]


Lokesh Jain commented on HDDS-7103:
-----------------------------------

[~NeilJoshi] [~szetszwo] [~erose] Thanks for working on this! I was thinking we 
should also alert SCM on the silent failure happening in the Datanode. I am not 
sure if some of these disk (data + RATIS) errors are reported to SCM today and 
if any subsequent action is taken on this. It would probably be good to show 
these in the SCM UI.

> Ratis log storage directories unchecked causing unhandled exception on 
> datanode restart
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-7103
>                 URL: https://issues.apache.org/jira/browse/HDDS-7103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple 
> disks and there is a corruption causing the same directory found on each 
> disk, ratis throws an unhandled exception.  The unhandled exception prevents 
> the datanode from creating pipelines.  The datanode remains up with the user 
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property 
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. 
> _dn1,dn2_ . Having the same directories in both disks.  On datanode start 
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1  | 2022-08-03 22:05:54 INFO  XceiverServerRatis:481 - 
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1  | 2022-08-03 22:05:54 WARN  EndpointStateMachine:236 - 
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1  | java.io.IOException: More than one directories found for 
> 01a173a0-6bd2-478a-8598-05df3a6f318a: 
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, 
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it.  This issue had been 
> identified and discussed in a previous PR for the hdds volume diskchecker, PR 
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue.  This 
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current 
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue 
> instead of throwing and unhandled IOException, see 
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

Reply via email to