Adar Dembo created KUDU-2904:
--------------------------------
Summary: Master shouldn't allow master tablet operations after a
disk failure
Key: KUDU-2904
URL: https://issues.apache.org/jira/browse/KUDU-2904
Project: Kudu
Issue Type: Bug
Components: fs, master
Affects Versions: 1.11.0
Reporter: Adar Dembo
The master doesn't register any FS error handlers, which means that in the
event of a disk failure that doesn't intrinsically crash the server (i.e. a
disk failure to one of several directories), the master tablet is not failed
and may undergo additional MM ops. This is forbidden: the invariant is that a
tablet with a failed disk should itself fail. In the master perhaps the
behavior should be more severe (i.e. perhaps the master should crash itself).
This surfaced with a user report of multiple minor delta compactions on a
master even after one of them had failed during a SyncDir() call on its
superblock flush. The metadata was corrupt: the blocks added to the superblock
by the compaction were marked as deleted in the LBM. It's unclear whether the
in-memory state of the superblock was corrupted by the failure and subsequent
compactions, or whether the corruption was caused by something else. Either
way, no operations should have been permitted following the initial failure.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)