[jira] [Commented] (CASSANDRA-7361) Cassandra locks up in full GC when you assign the entire heap to row cache

Jacek Furmankiewicz (JIRA) Sat, 07 Jun 2014 13:18:19 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020954#comment-14020954
 ]


Jacek Furmankiewicz commented on CASSANDRA-7361:
------------------------------------------------

Thanks Robert.

The counter-intuitive part is to turn off row caching for the large CFs..those 
are precisely the ones we want to optimize for fast reads (because we cannot 
fit them into our L1 caches within our own heap). We will probably have to 
change them to "keys only" and see how that works for us.

On our end, we've had nothing but good experience with G1. We replaced tons of 
highly tuned CMS settings with a simple "useG1 with expected pause of 10ms" and 
it performed better, or at least just as good. Have never seen a STW occurrence 
with it, even under very heavy load during performance testing.

The parts that still bother me are:

a) we did not have this row cache issue with 1.1
b) Cassandra "hangs on" to the keys and does not release them, after long after 
they've been read

Something about both of these does not seem right, 
but if the C* team is not bothered by either of them,
we will have to live with it and work around it.


> Cassandra locks up in full GC when you assign the entire heap to row cache
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7361
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7361
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Ubuntu, RedHat, JDK 1.7
>            Reporter: Jacek Furmankiewicz
>            Priority: Minor
>         Attachments: histogram.png, leaks_report.png, top_consumers.png
>
>
> We have a long running batch load process, which runs for many hours.
> Massive amount of writes, in large mutation batches (we increase the thrift 
> frame size to 45 MB).
> Everything goes well, but after about 3 hrs of processing everything locks 
> up. We start getting NoHostsAvailable exceptions on the Java application side 
> (with Astyanax as our driver), eventually socket timeouts.
> Looking at Cassandra, we can see that it is using nearly the full 8GB of heap 
> and unable to free it. It spends most of its time in full GC, but the amount 
> of memory does not go down.
> Here is a long sample from jstat to show this over an extended time period
> e.g.
> http://aep.appspot.com/display/NqqEagzGRLO_pCP2q8hZtitnuVU/
> This continues even after we shut down our app. Nothing is connected to 
> Cassandra any more, yet it is still stuck in full GC and cannot free up 
> memory.
> Running nodetool tpstats shows that nothing is pending, all seems OK:
> {quote}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> ReadStage                         0         0       69555935         0        
>          0
> RequestResponseStage              0         0              0         0        
>          0
> MutationStage                     0         0       73123690         0        
>          0
> ReadRepairStage                   0         0              0         0        
>          0
> ReplicateOnWriteStage             0         0              0         0        
>          0
> GossipStage                       0         0              0         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> MigrationStage                    0         0             46         0        
>          0
> MemoryMeter                       0         0           1125         0        
>          0
> FlushWriter                       0         0            824         0        
>         30
> ValidationExecutor                0         0              0         0        
>          0
> InternalResponseStage             0         0             23         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> MemtablePostFlusher               0         0           1783         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> PendingRangeCalculator            0         0              1         0        
>          0
> CompactionExecutor                0         0          74330         0        
>          0
> commitlog_archiver                0         0              0         0        
>          0
> HintedHandoff                     0         0              0         0        
>          0
> Message type           Dropped
> RANGE_SLICE                  0
> READ_REPAIR                  0
> PAGED_RANGE                  0
> BINARY                       0
> READ                       585
> MUTATION                 75775
> _TRACE                       0
> REQUEST_RESPONSE             0
> COUNTER_MUTATION             0
> {quote}
> We had this happen on 2 separate boxes, one with 2.0.6, the other with 2.0.8.
> Right now this is a total blocker for us. We are unable to process the 
> customer data and have to abort in the middle of large processing.
> This is a new customer, so we did not have a chance to see if this occurred 
> with 1.1 or 1.2 in the past (we moved to 2.0 recently).
> We have the Cassandra process still running, pls let us know if there is 
> anything else we could run to give you more insight.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7361) Cassandra locks up in full GC when you assign the entire heap to row cache

Reply via email to