[ 
https://issues.apache.org/jira/browse/KUDU-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-1840.
--------------------------------------
       Resolution: Duplicate
    Fix Version/s: n/a

This is really a dupe of KUDU-616 which I wasn't able to find until Mike sent 
it to me.

> Tolerate disk failures on single tablet servers
> -----------------------------------------------
>
>                 Key: KUDU-1840
>                 URL: https://issues.apache.org/jira/browse/KUDU-1840
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Jean-Daniel Cryans
>             Fix For: n/a
>
>
> The way we store data on disk is akin to striping or RAID 0, losing one disk 
> means that the rest of the data isn't recoverable on the other disks.
> Users would see something like after replacing a bad disk:
> {noformat}
> an 18, 10:20:55.693 AM  INFO  server_base.cc:179  
> Could not load existing FS layout: Not found: /data/4/kudu/instance: No such 
> file or directory (error 2)
> Jan 18, 10:20:55.693 AM  INFO  server_base.cc:180  
> Creating new FS layout
> Jan 18, 10:20:55.693 AM  FATAL  tablet_server_main.cc:64  
> Check failed: _s.ok() Bad status: Already present: Could not create new FS 
> layout: FSManager root is not empty: /data/1/kudu-wal
> {noformat}
> The above shows a tablet server figuring out that one folder is empty, but 
> then that other folders have data so it crashes. Currently the workaround is 
> to manually delete the data in all the remaining Kudu folders.
> As we fix this, one thing to keep in mind is that WALs can only be stored on 
> one disk, so even if we tolerate data disk failures it would still not help 
> if the WALs' disk dies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to