Mike Percy created KUDU-2795:
--------------------------------

             Summary: Prevent cascading failures by detecting that disks are 
full and rejecting attempts to add additional replicas to a tablet server
                 Key: KUDU-2795
                 URL: https://issues.apache.org/jira/browse/KUDU-2795
             Project: Kudu
          Issue Type: Task
          Components: master, tserver
    Affects Versions: 1.8.0
            Reporter: Mike Percy


Over the weekend a case was reported where the tablet server disks were 
near-full across a Kudu cluster. One finally reached the tipping point and 
crashed because the WAL disk was out of space and a write failed. This caused a 
cascading failure because the replicas on that tablet server were re-replicated 
to the rest of the cluster nodes, pushing them beyond the tipping point and 
eventually the whole cluster crashed.

We could potentially prevent the cascading failure by detecting that a tablet 
server is nearly full and reject or prevent attempts to move additional 
replicas to that server while it is in the "yellow zone" of disk space 
availability, preferring under-replicated tablets over an unavailable cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to