[ 
https://issues.apache.org/jira/browse/RATIS-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-1332:
------------------------------
    Fix Version/s: 2.0.0

> Ratis server couln't be recovered from failed initialization state
> ------------------------------------------------------------------
>
>                 Key: RATIS-1332
>                 URL: https://issues.apache.org/jira/browse/RATIS-1332
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Marton Elek
>            Assignee: Janus Chow
>            Priority: Blocker
>             Fix For: 2.0.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> I found this problem during the test of ratis 2.0.0-rc3 and earlier.
> I noticed that in some cases the Ozone Manager (with ratis enabled true) 
> couldn't be started any more (see HDDS-4703 for details).
> After some investigation I found the following problem:
>  1. Ratis server initialized BEFORE om RPC (OzoneManager.startRpcServer)
>  2. If the RPC server is failed (due to missing DNS for example) the Ratis 
> server is stopped during the initialization
>  3. AtomicOutputStream can leave some tmp files behind (like raft-meta.tmp, 
> if it's not yet renamed)
>  4. After DNS problem is fixed the OM couldn't be started anymore as 
> RaftStorageImpl.analyzeAndRecoverStorage requires FORMATTED or empty (!!!) 
> directory. Directory with leftover tmp file is not empty.
> {code}
>   private StorageState analyzeAndRecoverStorage(boolean toLock) throws 
> IOException {
>     StorageState storageState = storageDir.analyzeStorage(toLock);
>     if (storageState == StorageState.NORMAL) {
>         // ...
>     } else if (storageState == StorageState.NOT_FORMATTED &&
>         storageDir.isCurrentEmpty()) {
>      //never called this if one .tmp file exists from the previous attempts
>       format();
>       return StorageState.NORMAL;
>     } else {
>       return storageState;
>     }
>   }
> {code}
> The problem is that `cleanMetaTmpFile();` is called only in the first branch, 
> but before checking if the directory is empty or not...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to