[ 
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598120#comment-17598120
 ] 

Neil Joshi commented on HDDS-7103:
----------------------------------

Thanks [~szetszwo] for the ratis issue patches to handle this tickets failure 
condition.  [~erose], with the new Ratis behavior to throw an exception if it 
cannot write to the ratis directory and not to retry on another volume, how 
does this affect the datanode?  On this write failure IOException thrown by the 
ratis server, how does Ozone handle this exception?  Does the datanode report 
to the SCM that the pipeline is unhealthy and the pipeline to be closed without 
further problem?  How should this error condition be handled? 

> Ratis log storage directories unchecked causing unhandled exception on 
> datanode restart
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-7103
>                 URL: https://issues.apache.org/jira/browse/HDDS-7103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple 
> disks and there is a corruption causing the same directory found on each 
> disk, ratis throws an unhandled exception.  The unhandled exception prevents 
> the datanode from creating pipelines.  The datanode remains up with the user 
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property 
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. 
> _dn1,dn2_ . Having the same directories in both disks.  On datanode start 
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1  | 2022-08-03 22:05:54 INFO  XceiverServerRatis:481 - 
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1  | 2022-08-03 22:05:54 WARN  EndpointStateMachine:236 - 
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1  | java.io.IOException: More than one directories found for 
> 01a173a0-6bd2-478a-8598-05df3a6f318a: 
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, 
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it.  This issue had been 
> identified and discussed in a previous PR for the hdds volume diskchecker, PR 
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue.  This 
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current 
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue 
> instead of throwing and unhandled IOException, see 
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to