[
https://issues.apache.org/jira/browse/CASSANDRA-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436259#comment-15436259
]
Stefania commented on CASSANDRA-11594:
--------------------------------------
I've reproduced a leak of directory descriptors with a
[test|https://github.com/stef1927/cassandra-dtest/commit/6ca5ae864676589132e656105da4c204621692be#diff-60812631a43b8e1f0c9fb53d9f7487ebR209].
The bug mentioned above would leak file descriptors for directories if:
# A transaction log file is present, this means there is an ongoing transaction
such as flushing or compaction (note that repair uses many validation
compactions)
# A request to list the files in a table directory is issued in parallel, this
would occur in the following cases:
## when loading sstables at startup or via nodetool
## when calculating the snapshots size via nodetool listsnapshots or reading
the SnapshotSize metric
## when adding new tables to the schema or updating the keyspace
## when creating indexes or materialized views, or when they get rebuilt
Of all the points above, I think the most likely is the SnapshotSize metric
being monitored, which is how I reproduced the leak in my test, by inserting
data in parallel with a repair and with a nodetool listshapshots (which
exercises the same code as reading SnapshotSize does).
So [~n0rad] do you monitor SnapshotSize by any chance, or can you think of any
other of the points above that might apply? If you want to test the patch let
us know, I can apply it to your release version.
> Too many open files on directories
> ----------------------------------
>
> Key: CASSANDRA-11594
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11594
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: n0rad
> Assignee: Stefania
> Priority: Critical
> Attachments: Grafana Cassandra Cluster.png, openfiles.zip,
> screenshot.png
>
>
> I have a 6 nodes cluster in prod in 3 racks.
> each node :
> - 4Gb commitlogs on 343 files
> - 275Gb data on 504 files
> On saturday, 1 node in each rack crash with with too many open files (seems
> to be the similar node in each rack).
> {code}
> lsof -n -p $PID give me 66899 out of 65826 max
> {code}
> it contains 64527 open directories (2371 uniq)
> a part of the list :
> {code}
> java 19076 root 2140r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2141r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2142r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2143r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2144r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2145r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2146r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2147r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2148r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2149r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2150r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2151r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2152r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2153r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2154r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java 19076 root 2155r DIR 8,17 143360 4386718705
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> {code}
> The 3 others nodes crashes 4 hours later
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)