Andrew Wong commented on KUDU-2359:

I spent some time looking at a test cluster that had a few bad disks with the 
following behavior in their logs. On one of the servers, which had failed in 
Kudu 1.5 (pre-disk-failure handling), for some time following the failures, the 
server would attempt to start up and fail immediately with:

{{Fatal I/O error, context: /data/6/kudu/instance}}

After a few months of this (the server remaining down), the error changed:

{{Check failed: _s.ok() Bad status: Already present: Could not create new FS 
layout: FSManager root is not empty: /data/1/kudu}}

This message indicates that Kudu couldn't find an instance file for a data 
directory, and upon examining the FS a bit more, noticed this that 
/data/6/instance was indeed missing, but seemingly not because the disk was 
removed and replaced. Rather, it seemed that the instance file, after some time 
on the failed disk, vanished, and this is something that we need to consider.

{{cat: /data/6/kudu/instance: No such file or directory}}

{{ls: cannot access /data/6/kudu: No such file or directory}}

{{ls: reading directory /data/6: Input/output error}}

> tserver should allow starting with a small number of missing data dirs
> ----------------------------------------------------------------------
>                 Key: KUDU-2359
>                 URL: https://issues.apache.org/jira/browse/KUDU-2359
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs, tserver
>            Reporter: Todd Lipcon
>            Priority: Major
> Often when a disk fails, its mount point will not come back up when the 
> server is restarted. Currently, Kudu will respond to this by failing to 
> restart with an error like:
> F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() 
> Bad status: Already present: FS layout already exists; not overwriting 
> existing layout. See 
> https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: 
> unable to create file system roots: FSManager roots already exist: 
> /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal
> However, this defeats some of the advantages of the "allow single disk 
> failure" work. One could use the update_data_dirs tool to remove the missing 
> disk, but you'd also need to persistently change the configuration of the 
> daemon, which is hard to do with a consistent configuration management.

This message was sent by Atlassian JIRA

Reply via email to