[ 
https://issues.apache.org/jira/browse/CASSANDRA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631559#comment-14631559
 ] 

Wade Simmons commented on CASSANDRA-9382:
-----------------------------------------

I just wanted to add that we ran into this issue with our 2.0.16 cluster. We 
were running range based repairs after a few nodes were down and after a little 
bit of time had accumulated 75GB of these left open files. We had to restart 
the cassandra node to recover this leaked disk space.

I looked at Yuki's patch (thanks Yuki!) and it seems to make sense but I don't 
know enough about how this part of Cassandra works to weigh in. Just wanted to 
note that this is an issue that people run into in the wild.

> Snapshot file descriptors not getting purged (possible fd leak)
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-9382
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9382
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Mark Curtis
>            Assignee: Yuki Morishita
>             Fix For: 2.1.x, 2.0.x
>
>         Attachments: yjp-heapdump.png
>
>
> OpsCenter has the repair service which does a lot of small range repairs. 
> Each repair would generate a snapshot as per normal. The cluster was showing 
> a steady increase in disk space over the course of a couple of days and the 
> only way to workaround the issue was to restart the node.
> Upon some further inspection it was seen that a lsof output of the cassandra 
> process was still showing file descriptors for snapshots that no longer 
> existed on the file system. For example:
> {code}
> ava    5822 cassandra  DEL    REG             202,32                 7359833 
> /media/ephemeral1/cassandra/data/somekeyspace/table1/snapshots/669a3a30-f3d3-11e4-bec6-3f6c4fb06498/somekeyspace-table1-jb-897689-Data.db
> {code}
> We also took a heapdump which basically showed the same thing, lots of 
> references to these file handles. We checked the logs for any errors 
> especially relating to repairs that might have failed but there was nothing 
> observed
> The repair service logs in OpsCenter showed also that all repairs (1000s of 
> them) had completed successfully, again showing that there was no repair 
> issue.
> I have not yet been able to reproduce the issue locally on a test box. The 
> cluster that this original issue appeared on was a production cluster with 
> the following spec:
> cassandra_versions: 2.0.14.352
> cluster_cores : 8, 
> cluster_instance_types : i2.2xlarge
> cluster_os : Amazon linux amd64 
> node count: 4
> node java version: Oracle Java 1.7.0_51



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to