[ 
https://issues.apache.org/jira/browse/CASSANDRA-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826912#comment-13826912
 ] 

graham sanderson commented on CASSANDRA-6275:
---------------------------------------------

Yes I believe we can mitigate the problem in the OpCenter case, however it is a 
good test bed since it makes the problem easy to spot - note it seems to be 
worse under high read/write activity on tracked keyspaces/CFs, however that 
makes sense.

Note I was poking (somewhat blindly) thru the (2.0.2) code (partly out of 
interest) looking for what might be leaking these file handles, and I also 
found a heap dump. I discovered what turned out to be 
https://issues.apache.org/jira/browse/CASSANDRA-6358 which leaks 
FileDescriptors though their refCounts all seemed to be 0. In any case there 
weren't enough (total FileDescriptors - in a heap dump) to account for the 
problem. They were also for mem-mapped files (the ifile in SSTableReader) and 
none of the leaked delete file handles were mem-mapped (since they were 
compressed data files)

That said CASSANDRA-6358 was pinning the SSTableReaders in memory (since the 
Runnable was an anonymous inner class), so someone with more knowledge of the 
code might have a better idea, if this might be a problem (other than the 
memory leak)

I don't have an environment yet where I can easily build and install code 
changes, though we could downgrade our system test environment to 2.0.0 to see 
if we can reproduce the problem there - unsure if we can downgrade to 1.2.X 
easily given our current testing.

Note while I was looking at the code I came across CASSANDRA-5555... What 
caught my eye was the interaction between FileCacheService and RAR.deallocate, 
but more specifically related to the fact that this change, added a concurrent 
structure inside another separate concurrent structure, and it seemed like 
there might be a case where a RAR was recycled into a concurrent queue which 
was already completely removed and deallocated, in which case it would get GCed 
without close, presumably causing a file handle leak on the native side. Though 
I couldn't come up with any significantly convincing interactions that would 
cause this to happen without some very very unlucky things happening (and my 
knowledge of the google cache implementation was even more limited!), so this 
is unlikely the cause of this issue (especially if the issue doesn't happen in 
the 1.2.7+ branch).

> 2.0.x leaks file handles
> ------------------------
>
>                 Key: CASSANDRA-6275
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6275
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: java version "1.7.0_25"
> Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
> Linux cassandra-test1 2.6.32-279.el6.x86_64 #1 SMP Thu Jun 21 15:00:18 EDT 
> 2012 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Mikhail Mazursky
>            Assignee: Michael Shuler
>         Attachments: c_file-descriptors_strace.tbz, cassandra_jstack.txt, 
> leak.log, position_hints.tgz, slog.gz
>
>
> Looks like C* is leaking file descriptors when doing lots of CAS operations.
> {noformat}
> $ sudo cat /proc/15455/limits
> Limit                     Soft Limit           Hard Limit           Units    
> Max cpu time              unlimited            unlimited            seconds  
> Max file size             unlimited            unlimited            bytes    
> Max data size             unlimited            unlimited            bytes    
> Max stack size            10485760             unlimited            bytes    
> Max core file size        0                    0                    bytes    
> Max resident set          unlimited            unlimited            bytes    
> Max processes             1024                 unlimited            processes
> Max open files            4096                 4096                 files    
> Max locked memory         unlimited            unlimited            bytes    
> Max address space         unlimited            unlimited            bytes    
> Max file locks            unlimited            unlimited            locks    
> Max pending signals       14633                14633                signals  
> Max msgqueue size         819200               819200               bytes    
> Max nice priority         0                    0                   
> Max realtime priority     0                    0                   
> Max realtime timeout      unlimited            unlimited            us 
> {noformat}
> Looks like the problem is not in limits.
> Before load test:
> {noformat}
> cassandra-test0 ~]$ lsof -n | grep java | wc -l
> 166
> cassandra-test1 ~]$ lsof -n | grep java | wc -l
> 164
> cassandra-test2 ~]$ lsof -n | grep java | wc -l
> 180
> {noformat}
> After load test:
> {noformat}
> cassandra-test0 ~]$ lsof -n | grep java | wc -l
> 967
> cassandra-test1 ~]$ lsof -n | grep java | wc -l
> 1766
> cassandra-test2 ~]$ lsof -n | grep java | wc -l
> 2578
> {noformat}
> Most opened files have names like:
> {noformat}
> java      16890 cassandra 1636r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1637r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1638r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1639r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1640r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1641r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1642r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1643r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1644r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1645r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1646r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1647r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1648r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1649r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1650r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1651r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1652r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1653r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1654r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1655r      REG             202,17 161158485     
> 655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1656r      REG             202,17  88724987     
> 655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> {noformat}
> Also, when that happens it's not always possible to shutdown server process 
> via SIGTERM. Have to use SIGKILL.
> p.s. See mailing thread for more context information 
> https://www.mail-archive.com/user@cassandra.apache.org/msg33035.html



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to