Alexey Serbin created KUDU-3025:
-----------------------------------

             Summary: Add metric for the open file descriptors usage vs the 
limit
                 Key: KUDU-3025
                 URL: https://issues.apache.org/jira/browse/KUDU-3025
             Project: Kudu
          Issue Type: Improvement
          Components: master, tserver
            Reporter: Alexey Serbin


In the case of even replica distribution across all available nodes, once one 
tablet server hits the maximum number of open file descriptors and go down 
(e.g., upon hosting another tablet replica), the system will automatically 
re-replicate tablet replicas from the tablet server, most likely bringing other 
tablet servers down as well.  That's a cascading failure scenario that nobody 
wants to experience.

Monitoring the number of open file descriptors vs the limit can help to prevent 
full Kudu cluster outage in such case, if operators are given a chance to 
handle those situations proactively.  Once some threshold is reached (e.g., 
90%), an operator could update the limit via corresponding {{ulimit}} setting, 
preventing an outage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to