[ 
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607924#comment-17607924
 ] 

Ethan Rose commented on HDDS-7103:
----------------------------------

Hi [~szetszwo] [~NeilJoshi] I have marked this as blocked until Ozone starts 
using Ratis 2.4.0 since my understanding is that we will need the Ratis fix 
present in that release before we can work on the Ozone fix. Please correct 
this if I am understanding wrong.

> Ratis log storage directories unchecked causing unhandled exception on 
> datanode restart
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-7103
>                 URL: https://issues.apache.org/jira/browse/HDDS-7103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple 
> disks and there is a corruption causing the same directory found on each 
> disk, ratis throws an unhandled exception.  The unhandled exception prevents 
> the datanode from creating pipelines.  The datanode remains up with the user 
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property 
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. 
> _dn1,dn2_ . Having the same directories in both disks.  On datanode start 
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1  | 2022-08-03 22:05:54 INFO  XceiverServerRatis:481 - 
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1  | 2022-08-03 22:05:54 WARN  EndpointStateMachine:236 - 
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1  | java.io.IOException: More than one directories found for 
> 01a173a0-6bd2-478a-8598-05df3a6f318a: 
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, 
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it.  This issue had been 
> identified and discussed in a previous PR for the hdds volume diskchecker, PR 
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue.  This 
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current 
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue 
> instead of throwing and unhandled IOException, see 
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to