Andrew Wong created KUDU-3228:
---------------------------------

             Summary: Background process that checks for corrupted block and 
failed disks
                 Key: KUDU-3228
                 URL: https://issues.apache.org/jira/browse/KUDU-3228
             Project: Kudu
          Issue Type: Improvement
          Components: cfile, fs, tserver
            Reporter: Andrew Wong


Currently, CFile corruption and failed disks will result in any bad tablets 
being marked as failed, being re-replicated elsewhere, and any scans that were 
in progress for them being retried at other servers.

Rather than waiting for the first bad access to do this, we may want to 
implement a background task that checks for corruption and proactively 
re-replicates such tablets. That way, especially when there are long periods of 
client inactivity, we can the faulty-hardware-related re-replication out of the 
way.

The task should probably only run when the tserver isn't serving many scans or 
writes. It should also avoid polluting the block cache, if attempting to check 
for CFile corruption.

HDFS has a "disk checker" task that may be worth drawing inspiration from.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to