[
https://issues.apache.org/jira/browse/KUDU-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans resolved KUDU-1840.
--------------------------------------
Resolution: Duplicate
Fix Version/s: n/a
This is really a dupe of KUDU-616 which I wasn't able to find until Mike sent
it to me.
> Tolerate disk failures on single tablet servers
> -----------------------------------------------
>
> Key: KUDU-1840
> URL: https://issues.apache.org/jira/browse/KUDU-1840
> Project: Kudu
> Issue Type: Improvement
> Components: fs
> Reporter: Jean-Daniel Cryans
> Fix For: n/a
>
>
> The way we store data on disk is akin to striping or RAID 0, losing one disk
> means that the rest of the data isn't recoverable on the other disks.
> Users would see something like after replacing a bad disk:
> {noformat}
> an 18, 10:20:55.693 AM INFO server_base.cc:179
> Could not load existing FS layout: Not found: /data/4/kudu/instance: No such
> file or directory (error 2)
> Jan 18, 10:20:55.693 AM INFO server_base.cc:180
> Creating new FS layout
> Jan 18, 10:20:55.693 AM FATAL tablet_server_main.cc:64
> Check failed: _s.ok() Bad status: Already present: Could not create new FS
> layout: FSManager root is not empty: /data/1/kudu-wal
> {noformat}
> The above shows a tablet server figuring out that one folder is empty, but
> then that other folders have data so it crashes. Currently the workaround is
> to manually delete the data in all the remaining Kudu folders.
> As we fix this, one thing to keep in mind is that WALs can only be stored on
> one disk, so even if we tolerate data disk failures it would still not help
> if the WALs' disk dies.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)