[jira] [Commented] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

Neil Joshi (Jira) Sun, 14 Aug 2022 16:15:29 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579462#comment-17579462
 ]


Neil Joshi commented on HDDS-7103:
----------------------------------

[~szetszwo] ,
{quote}In your case, Ratis won't create the second directory after the change. 
It will throw an exception when it fails to read the existing directory. 
{quote}
Should a failure occur when writing to one ratis group directory on a disk, 
Ratis will _now_ fail throwing an exception and _not retry_ writing a new 
directory on another volume.  With this, how is the thrown exception handled by 
Ozone?  Does it mark the volume as unhealthy together with a datanode shutdown 
for the admin to replace the failed volume?  Or does the volume get marked 
unhealthy, the system remains in service and Ratis uses a different unique 
ratis group directory (than the one that failed) on another volume for the 
ratis logs?

> Ratis log storage directories unchecked causing unhandled exception on 
> datanode restart
> ---------------------------------------------------------------------------------------
>
>                 Key: HDDS-7103
>                 URL: https://issues.apache.org/jira/browse/HDDS-7103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Under the condition the ratis storage logs are configured to be on multiple 
> disks and there is a corruption causing the same directory found on each 
> disk, ratis throws an unhandled exception.  The unhandled exception prevents 
> the datanode from creating pipelines.  The datanode remains up with the user 
> only detecting a failure through the datanode logs.
> Error can be seen with ozone cluster with configuration property 
> _*dfs.container.ratis.datanode.storage.dir*_ set to two volume locations, ie. 
> _dn1,dn2_ . Having the same directories in both disks.  On datanode start 
> error will be logged when bringing up the XceiverServerRatis.
> Snippet of logged error:
> {code:java}
> ozone-datanode-1  | 2022-08-03 22:05:54 INFO  XceiverServerRatis:481 - 
> Starting XceiverServerRatis feb90744-e0e7-4b2e-8d57-02213ce29693
> ozone-datanode-1  | 2022-08-03 22:05:54 WARN  EndpointStateMachine:236 - 
> Unable to communicate to SCM server at scm:9861 for past 0 seconds.
> ozone-datanode-1  | java.io.IOException: More than one directories found for 
> 01a173a0-6bd2-478a-8598-05df3a6f318a: 
> [/mydata/dn1/01a173a0-6bd2-478a-8598-05df3a6f318a, 
> /mydata/dn2/01a173a0-6bd2-478a-8598-05df3a6f318a]
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.chooseStorageDir(ServerState.java:177)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:113)
> ozone-datanode-1  |     at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201){code}
> This jira is filed to track the issue and to resolve it.  This issue had been 
> identified and discussed in a previous PR for the hdds volume diskchecker, PR 
> #2158, https://github.com/apache/ozone/pull/2158#issuecomment-836580999.
> Idea from the PR was to omit directories with the problem and continue.  This 
> was to be done either,
> i.) with a checker prior to the XceiverServerRatis; if this is in the current 
> Ozone, how to configure it to resolve this issue.
> ii.) modifiy the Ratis code to remove affected directories and continue 
> instead of throwing and unhandled IOException, see 
> https://github.com/apache/ratis/blob/040bc52e19a5e36f5710ccd4fc1981e862e691e8/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L107-L117.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7103) Ratis log storage directories unchecked causing unhandled exception on datanode restart

Reply via email to