[ 
https://issues.apache.org/jira/browse/CASSANDRA-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020905#comment-14020905
 ] 

Robert Stupp commented on CASSANDRA-7361:
-----------------------------------------

Regarding Java GCs and heap size (not just C* related): 
JVMs should never have a heap size that exceeds 30GB - or you get stuck with 
very long (even concurrent (ParNew/CMS) GC phases. And using such a large heap 
requires you to check and optimize GC settings. Not just once - it is an 
iteration of "try and error". Larger heaps (16...30GB) might be good - but it 
really depends on the characteristics of the application, the current workload, 
the characteristics of the data, the performance (CPU, RAM) your hardware etc 
etc. Generally I would not increase C* heap size above 8GB. I tried C* with 
larger heaps (12 and 16 GB per node) but there was no really measurable 
improvement.
Using heap sizes above 30GB increases the probability of full GCs - maybe just 
because CMS is not able to work as fast as necessary.
G1 GC is afaik an STW (stop the world) collector. I tried it with Java 6 and 
Java 7 and could not find any benefit compared to ParNew/CMS (a custom 
application not C*) - in fact G1 resulted in less application throughput and 
large response time peaks (caused by G1's STW characteristic).

Regarding your current situation:
It might be worth trying a new Java version. There are a lot of GC 
(ParNew/CMS/G1) improvements in Java 8. Even minor releases of Java 7 regularly 
get improvements.
If you are really familiar with GC and their behavior, you may try G1 and 
different GC settings - but do this only if you know what you are doing and 
have time left to spend on GC tuning. A "single good shot" cannot be 
generalized.

In case of the row cache and cache expires (I am not a C* "core" developer): 
try to turn off row cache for "big" CFs. Eventually the operating system's 
block cache might be sufficient (OS block cache can work very efficiently 
because of the "append only" characteristics of C*).

And it might be worth to use Datastax OpsCenter and view the relevant 
node/cluster/CF performance numbers. I like the graphs for ParNew/CMS GC 
time/count, OS memory/load/cpu, key/row cache hit ratio/count and disk latency, 
throughput, IOPS graphs. And it allows you to view the bloom filter false 
positive ratio (which is very interesting, if you know the access counts). 
Altogether these graphs can tell you a lot about your cluster and your 
application. But it requires that you spend time to get an understanding what 
the numbers really mean.

> Cassandra locks up in full GC when you assign the entire heap to row cache
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7361
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7361
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Ubuntu, RedHat, JDK 1.7
>            Reporter: Jacek Furmankiewicz
>            Priority: Minor
>         Attachments: histogram.png, leaks_report.png, top_consumers.png
>
>
> We have a long running batch load process, which runs for many hours.
> Massive amount of writes, in large mutation batches (we increase the thrift 
> frame size to 45 MB).
> Everything goes well, but after about 3 hrs of processing everything locks 
> up. We start getting NoHostsAvailable exceptions on the Java application side 
> (with Astyanax as our driver), eventually socket timeouts.
> Looking at Cassandra, we can see that it is using nearly the full 8GB of heap 
> and unable to free it. It spends most of its time in full GC, but the amount 
> of memory does not go down.
> Here is a long sample from jstat to show this over an extended time period
> e.g.
> http://aep.appspot.com/display/NqqEagzGRLO_pCP2q8hZtitnuVU/
> This continues even after we shut down our app. Nothing is connected to 
> Cassandra any more, yet it is still stuck in full GC and cannot free up 
> memory.
> Running nodetool tpstats shows that nothing is pending, all seems OK:
> {quote}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> ReadStage                         0         0       69555935         0        
>          0
> RequestResponseStage              0         0              0         0        
>          0
> MutationStage                     0         0       73123690         0        
>          0
> ReadRepairStage                   0         0              0         0        
>          0
> ReplicateOnWriteStage             0         0              0         0        
>          0
> GossipStage                       0         0              0         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> MigrationStage                    0         0             46         0        
>          0
> MemoryMeter                       0         0           1125         0        
>          0
> FlushWriter                       0         0            824         0        
>         30
> ValidationExecutor                0         0              0         0        
>          0
> InternalResponseStage             0         0             23         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> MemtablePostFlusher               0         0           1783         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> PendingRangeCalculator            0         0              1         0        
>          0
> CompactionExecutor                0         0          74330         0        
>          0
> commitlog_archiver                0         0              0         0        
>          0
> HintedHandoff                     0         0              0         0        
>          0
> Message type           Dropped
> RANGE_SLICE                  0
> READ_REPAIR                  0
> PAGED_RANGE                  0
> BINARY                       0
> READ                       585
> MUTATION                 75775
> _TRACE                       0
> REQUEST_RESPONSE             0
> COUNTER_MUTATION             0
> {quote}
> We had this happen on 2 separate boxes, one with 2.0.6, the other with 2.0.8.
> Right now this is a total blocker for us. We are unable to process the 
> customer data and have to abort in the middle of large processing.
> This is a new customer, so we did not have a chance to see if this occurred 
> with 1.1 or 1.2 in the past (we moved to 2.0 recently).
> We have the Cassandra process still running, pls let us know if there is 
> anything else we could run to give you more insight.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to