[
https://issues.apache.org/jira/browse/KUDU-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-3025:
------------------------------
Component/s: metrics
> Add metric for the open file descriptors usage vs the limit
> -----------------------------------------------------------
>
> Key: KUDU-3025
> URL: https://issues.apache.org/jira/browse/KUDU-3025
> Project: Kudu
> Issue Type: Improvement
> Components: master, metrics, tserver
> Reporter: Alexey Serbin
> Priority: Major
> Labels: Availability, observability, scalability
>
> In the case of even replica distribution across all available nodes, once one
> tablet server hits the maximum number of open file descriptors and go down
> (e.g., upon hosting another tablet replica), the system will automatically
> re-replicate tablet replicas from the tablet server, most likely bringing
> other tablet servers down as well. That's a cascading failure scenario that
> nobody wants to experience.
> Monitoring the number of open file descriptors vs the limit can help to
> prevent full Kudu cluster outage in such case, if operators are given a
> chance to handle those situations proactively. Once some threshold is
> reached (e.g., 90%), an operator could update the limit via corresponding
> {{ulimit}} setting, preventing an outage.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)