Mike Percy created KUDU-2795:
--------------------------------
Summary: Prevent cascading failures by detecting that disks are
full and rejecting attempts to add additional replicas to a tablet server
Key: KUDU-2795
URL: https://issues.apache.org/jira/browse/KUDU-2795
Project: Kudu
Issue Type: Task
Components: master, tserver
Affects Versions: 1.8.0
Reporter: Mike Percy
Over the weekend a case was reported where the tablet server disks were
near-full across a Kudu cluster. One finally reached the tipping point and
crashed because the WAL disk was out of space and a write failed. This caused a
cascading failure because the replicas on that tablet server were re-replicated
to the rest of the cluster nodes, pushing them beyond the tipping point and
eventually the whole cluster crashed.
We could potentially prevent the cascading failure by detecting that a tablet
server is nearly full and reject or prevent attempts to move additional
replicas to that server while it is in the "yellow zone" of disk space
availability, preferring under-replicated tablets over an unavailable cluster.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)