[ 
https://issues.apache.org/jira/browse/KUDU-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bankim Bhavsar updated KUDU-2904:
---------------------------------
    Code Review: https://gerrit.cloudera.org/c/14642/

> Master shouldn't allow master tablet operations after a disk failure
> --------------------------------------------------------------------
>
>                 Key: KUDU-2904
>                 URL: https://issues.apache.org/jira/browse/KUDU-2904
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs, master
>    Affects Versions: 1.11.0
>            Reporter: Adar Dembo
>            Assignee: Bankim Bhavsar
>            Priority: Critical
>              Labels: newbie
>
> The master doesn't register any FS error handlers, which means that in the 
> event of a disk failure that doesn't intrinsically crash the server (i.e. a 
> disk failure to one of several directories), the master tablet is not failed 
> and may undergo additional MM ops. This is forbidden: the invariant is that a 
> tablet with a failed disk should itself fail. In the master perhaps the 
> behavior should be more severe (i.e. perhaps the master should crash itself).
> This surfaced with a user report of multiple minor delta compactions on a 
> master even after one of them had failed during a SyncDir() call on its 
> superblock flush. The metadata was corrupt: the blocks added to the 
> superblock by the compaction were marked as deleted in the LBM. It's unclear 
> whether the in-memory state of the superblock was corrupted by the failure 
> and subsequent compactions, or whether the corruption was caused by something 
> else. Either way, no operations should have been permitted following the 
> initial failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to