Alexey Serbin created KUDU-3026:
-----------------------------------

             Summary: tserver: refuse to host another tablet replica if the 
number of open file descriptors is close to the limit
                 Key: KUDU-3026
                 URL: https://issues.apache.org/jira/browse/KUDU-3026
             Project: Kudu
          Issue Type: Improvement
          Components: master, tserver
            Reporter: Alexey Serbin


In the case of even replica distribution across all available nodes, once one 
tablet server hits the maximum number of open file descriptors and go down 
(e.g., upon hosting another tablet replica), the system will automatically 
re-replicate tablet replicas from the tablet server, most likely bringing other 
tablet servers down as well. That's a cascading failure scenario that nobody 
wants.

It would be great to change the behavior of tablet servers so they refuse to 
host another tablet replica if they sense that their resource usage is almost 
exhausted.  The number of open file descriptors is a very good first concrete 
step towards that goal.  That's something similar to the memory 
pressure-induced rejections behavior, but for the different sort of resource.

The system catalog (master) and other related components should be updated to 
react appropriately once receiving a rejection to host an additional tablet 
replica.  Also, extra provisions to help with monitoring the number of open 
file descriptors vs the limit (KUDU-3025) should be implemented to help in 
detecting and prevent such issues proactively.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to