[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414249#comment-16414249 ]
Andrew Wong commented on KUDU-2359: ----------------------------------- This should be doable by extending the architecture in place for the `kudu fs update_dirs` tool. The caveat here, and with the update tool, is that any tablets that are/were on the missing data directory are/should be started up in a failed state so they can be evicted and re-replicated elsewhere. For the update tool, we have operators confront this tradeoff by requiring them to specify the `–force` flag. Ideally a similar flag could be used here, so at least the mean time to recovery is gated by the time it takes to update a flag, rather than the time it takes to run `kudu fs update_dirs`. It also begs the question, would operators even care about those failed tablets? If our re-replication story is robust enough to handle everything on its own, it could be seen as a pointless configuration. I suppose exposing it as a flag initially would give us that sort of info. > tserver should allow starting with a small number of missing data dirs > ---------------------------------------------------------------------- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver > Reporter: Todd Lipcon > Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)