[
https://issues.apache.org/jira/browse/KUDU-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bankim Bhavsar updated KUDU-2904:
---------------------------------
Status: In Review (was: In Progress)
> Master shouldn't allow master tablet operations after a disk failure
> --------------------------------------------------------------------
>
> Key: KUDU-2904
> URL: https://issues.apache.org/jira/browse/KUDU-2904
> Project: Kudu
> Issue Type: Bug
> Components: fs, master
> Affects Versions: 1.11.0
> Reporter: Adar Dembo
> Assignee: Bankim Bhavsar
> Priority: Critical
> Labels: newbie
>
> The master doesn't register any FS error handlers, which means that in the
> event of a disk failure that doesn't intrinsically crash the server (i.e. a
> disk failure to one of several directories), the master tablet is not failed
> and may undergo additional MM ops. This is forbidden: the invariant is that a
> tablet with a failed disk should itself fail. In the master perhaps the
> behavior should be more severe (i.e. perhaps the master should crash itself).
> This surfaced with a user report of multiple minor delta compactions on a
> master even after one of them had failed during a SyncDir() call on its
> superblock flush. The metadata was corrupt: the blocks added to the
> superblock by the compaction were marked as deleted in the LBM. It's unclear
> whether the in-memory state of the superblock was corrupted by the failure
> and subsequent compactions, or whether the corruption was caused by something
> else. Either way, no operations should have been permitted following the
> initial failure.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)