[
https://issues.apache.org/jira/browse/RATIS-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated RATIS-1332:
------------------------------
Fix Version/s: 2.0.0
> Ratis server couln't be recovered from failed initialization state
> ------------------------------------------------------------------
>
> Key: RATIS-1332
> URL: https://issues.apache.org/jira/browse/RATIS-1332
> Project: Ratis
> Issue Type: Bug
> Components: server
> Reporter: Marton Elek
> Assignee: Janus Chow
> Priority: Blocker
> Fix For: 2.0.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> I found this problem during the test of ratis 2.0.0-rc3 and earlier.
> I noticed that in some cases the Ozone Manager (with ratis enabled true)
> couldn't be started any more (see HDDS-4703 for details).
> After some investigation I found the following problem:
> 1. Ratis server initialized BEFORE om RPC (OzoneManager.startRpcServer)
> 2. If the RPC server is failed (due to missing DNS for example) the Ratis
> server is stopped during the initialization
> 3. AtomicOutputStream can leave some tmp files behind (like raft-meta.tmp,
> if it's not yet renamed)
> 4. After DNS problem is fixed the OM couldn't be started anymore as
> RaftStorageImpl.analyzeAndRecoverStorage requires FORMATTED or empty (!!!)
> directory. Directory with leftover tmp file is not empty.
> {code}
> private StorageState analyzeAndRecoverStorage(boolean toLock) throws
> IOException {
> StorageState storageState = storageDir.analyzeStorage(toLock);
> if (storageState == StorageState.NORMAL) {
> // ...
> } else if (storageState == StorageState.NOT_FORMATTED &&
> storageDir.isCurrentEmpty()) {
> //never called this if one .tmp file exists from the previous attempts
> format();
> return StorageState.NORMAL;
> } else {
> return storageState;
> }
> }
> {code}
> The problem is that `cleanMetaTmpFile();` is called only in the first branch,
> but before checking if the directory is empty or not...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)