[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-11-20 Thread Matt Stump (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219863#comment-14219863
 ] 

Matt Stump commented on CASSANDRA-8150:
---

I wanted to add more evidence and exposition for the above recommendations so 
that my argument can be better understood.

Young generation GC is pretty simple. The new heap is broken down into three 
segments, eden, and two survivor spaces s0 and s1. All new objects are 
allocated in eden, and once eden reaches a size threshold a minor GC is 
triggered. After the minor GC all surviving objects are moved to one of the 
survivor spaces, only one survivor space is active at a time. At the same time 
that GC for eden is triggered a GC for the active survivor space is also 
triggered. All live objects from both eden and the active survivor space are 
copied to the other survivor space and both eden and the previously active 
survivor space is wiped clean. Objects will bounce between the different 
survivor spaces until the MaxTenuringThreshold is hit (default in C* is 1). 
Once an object survives MaxTenuringThreshold number of collections it's copied 
to the tenured space which is governed by a different collector, in our 
instance CMS, but it could just as easily be G1. This act of copying is called 
promotion. The promotion from young generation to tenured space is what takes a 
long time. So if you see long ParNew GC pauses it's because many objects are 
being promoted. You decrease ParNew collection times by decreasing promotion.

What can cause many objects to be promoted? It's objects that have survived 
both the initial eden space collection and MaxTenuringThreshold number of 
collections in the survivor space. The main tunables are the size of the 
various spaces in young gen, and the MaxTenuringThreshold. By increasing the 
young generation space it decreases the frequency at which we have to run GC 
because more objects can accumulate before we reach 75% capacity. By increasing 
the young generation and the MaxTenuringThreshold you give the short lived 
objects more time to die, and dead objects don't get promoted. 

The vast majority of objects in C* are ephemeral short lived objects. The only 
thing that should live in tenured space is the key cache and in releases  2.1 
memtables. If most objects die in survivor space you've solved the long GC 
pauses for both young gen and tenured spaces. 

As a data point with the mixed cluster we're we've been experimenting with 
these options most aggressively the longest CMS pause in a 24 hour period went 
from  10s to less than 900ms and most nodes experienced a max of less than 
500ms. This is just the max CMS which could include an outlier like 
defragmentation. Average CMS is significantly less, less than 100ms. For ParNew 
collections we went from many many pauses in excess of 200ms to a max of 15ms 
cluster wide and an average of 5ms. ParNew collection frequency decreased from 
1 per second to one every 10s worst case to the average case of one every 16 
seconds.

This also unlocks additional throughput on large machines. For 20 cores 
machines I was able to increase throughput from 75k TPS to 110-120k TPS. For a 
40 core machine we more than doubled request throughput and significantly 
increased compaction throughput. 

I've asked a number of other larger customers to help validate the new 
settings. I now view GC pauses as a mostly solvable issue.

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump
Assignee: Brandon Williams

 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-11-20 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219967#comment-14219967
 ] 

Jeremiah Jordan commented on CASSANDRA-8150:


While I agree all of that sounds nice for read heavy workloads.  Have you used 
these settings with a write heavy workload?

From my experience when you have a write heavy workload, your young gen fills 
up with memtable data, which will and should be promoted to old gen.  So if 
you set your young gen size high, it takes for ever to copy all that stuff to 
old gen.  If you increase the MaxTenuringThreshold it makes that even worse, 
as all of the memtable data has to get copied back and forth inside young gen 
X times, and then there is even more memtable stuff which will build up, so 
the copy to old gen takes that much longer.

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump
Assignee: Brandon Williams

 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-11-20 Thread Matt Stump (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220149#comment-14220149
 ] 

Matt Stump commented on CASSANDRA-8150:
---

I don't disagree with your experience but I do disagree with the description of 
what is happening.  With the GC frequency that I described above the memtable 
will be moved to tenured space after about 60-80 seconds. All of the individual 
requests will create ephemeral objects which would be ideally handled by 
ParNew. 

Where we went wrong was growing the heap but not also increasing 
MaxTenuringThreshold. By default we set MaxTenuringThreshold to 1 which means 
promote everything that survives 2 GCs to tenured, which coupled with a small 
heap for the workload results in a very high promotion rate which is why we see 
the delays. The key is to always increase MaxTenuringThreshold and young gen 
more or less proportionally.  From the perspective of GC and the creation rate 
for ephemeral objects reads and writes are more or less identical. One could 
possibly even make the case that writes are even better suited for the settings 
I've outlined above because writes should put less presure on eden due to the 
simpler request path. In my opinion, and I hope to have data to back this up 
soon, is that write heavy vs read heavy GC tuning I think is mostly a red 
herring. 

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump
Assignee: Brandon Williams

 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-11-20 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220212#comment-14220212
 ] 

T Jake Luciani commented on CASSANDRA-8150:
---

Let's run some cstar tests with write and read workloads...

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump
Assignee: Brandon Williams
 Attachments: upload.png


 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-11-20 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220242#comment-14220242
 ] 

T Jake Luciani commented on CASSANDRA-8150:
---

I also learned we should not be using biased locking.  Here is a sample run 
showing 2.1 without and with biased locking disabled

http://cstar.datastax.com/graph?stats=0f0ec9a6-710c-11e4-af11-bc764e04482cmetric=op_rateoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=98.89ymin=0ymax=273028.8

{code}
-XX:-UseBiasedLocking
{code}

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump
Assignee: Brandon Williams
 Attachments: upload.png


 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-11-20 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220480#comment-14220480
 ] 

Liang Xie commented on CASSANDRA-8150:
--

IMHO, the MaxTenuringThreshold setting should be different per cluster, or say, 
per read/write pattern. In past, when i tuned another similar NoSQL system in 
our internal production clusters, i found i need to set it to different value 
to get a optimized status(e.g. one cluster is 3, but another cluster is 8 or 
sth else)
[~tjake], totally agreed with your point! i had seen several long safepoint due 
to using biased locking in Hadoop system:)

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump
Assignee: Brandon Williams
 Attachments: upload.png


 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-11-19 Thread Matt Stump (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218353#comment-14218353
 ] 

Matt Stump commented on CASSANDRA-8150:
---

I'm going to advocate strongly for 40-50% by default for eden. 

Additionally I'm going to advocate that we change the following:
- Increase ceiling on MAX_HEAP to 20G.
- Increasing the MaxTenuringThreshold to 6 or 8
- Include Instagram CMS enhancements.
- Increase thread/core affinity for GC
- Set XX:ParallelGCThreads and XX:ConcGCThreads to min(20, number of cores).
- Possibly, set XX:MaxGCPauseMillis to 20ms, but I haven't really tested this 
one.

Instagram CMS settings:
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3

Thread/core affinity settings:
JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity
JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs
JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768

I've seen decreased GC pause frequency, latency, and an increase in throughput 
using the above recommendations with customer installations using the above 
recommendations. This was observed with both read heavy and balanced workloads.

+[~rbranson] +[~tjake]

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump
Assignee: Brandon Williams

 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-10-27 Thread Matt Stump (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185270#comment-14185270
 ] 

Matt Stump commented on CASSANDRA-8150:
---

That's not quite true:
https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh#L75-L77

We'll use the min((100 * cores), (1/4 * max_heap))

Please reopen.

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump

 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8150) Simplify and enlarge new heap calculation

2014-10-20 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177260#comment-14177260
 ] 

Brandon Williams commented on CASSANDRA-8150:
-

We already use 1/4 since 2.0.

 Simplify and enlarge new heap calculation
 -

 Key: CASSANDRA-8150
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8150
 Project: Cassandra
  Issue Type: Improvement
  Components: Config
Reporter: Matt Stump

 It's been found that the old twitter recommendations of 100m per core up to 
 800m is harmful and should no longer be used.
 Instead the formula used should be 1/3 or 1/4 max heap with a max of 2G. 1/3 
 or 1/4 is debatable and I'm open to suggestions. If I were to hazard a guess 
 1/3 is probably better for releases greater than 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)