Adar Dembo created KUDU-2904:
--------------------------------

             Summary: Master shouldn't allow master tablet operations after a 
disk failure
                 Key: KUDU-2904
                 URL: https://issues.apache.org/jira/browse/KUDU-2904
             Project: Kudu
          Issue Type: Bug
          Components: fs, master
    Affects Versions: 1.11.0
            Reporter: Adar Dembo


The master doesn't register any FS error handlers, which means that in the 
event of a disk failure that doesn't intrinsically crash the server (i.e. a 
disk failure to one of several directories), the master tablet is not failed 
and may undergo additional MM ops. This is forbidden: the invariant is that a 
tablet with a failed disk should itself fail. In the master perhaps the 
behavior should be more severe (i.e. perhaps the master should crash itself).

This surfaced with a user report of multiple minor delta compactions on a 
master even after one of them had failed during a SyncDir() call on its 
superblock flush. The metadata was corrupt: the blocks added to the superblock 
by the compaction were marked as deleted in the LBM. It's unclear whether the 
in-memory state of the superblock was corrupted by the failure and subsequent 
compactions, or whether the corruption was caused by something else. Either 
way, no operations should have been permitted following the initial failure.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to