Re: constant CMS GC using CPU time

2012-10-26 Thread aaron morton
 How does compaction_throughput relate to memory usage?  
It reduces the rate of memory allocation. 
e.g. Say normally ParNew can keep up with the rate of memory usage without 
stopping for too long: so the rate of promotion is low'ish and every thing is 
allocated to Eden. If the allocation rate gets higher ParNew may be more 
frequent and objects may be promoted to tenured that don't really need to be 
there.  

  I assumed that was more for IO tuning.  I noticed that lowering 
 concurrent_compactors to 4 (from default of 8) lowered the memory used during 
 compactions.
Similar thing to above. This may reduce the number of rows held in memory at 
any instant for compaction. 

Only rows less than in_memory_compaction_limit are loaded into memory during 
compaction. So reducing that may reduce the memory usage.

  Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so 
 the number of rows and data dropped to a level it can handle.
Cool. Sorry if took so long to get there. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 8:08 AM, Bryan Talbot btal...@aeriagames.com wrote:

 On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.com wrote:
 This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.
 
 Yes, I think that was at least part of the issue.
 
 The rough numbers I've used to estimate working set are:
 
 * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are 
 just a big array) 714 MB
 * memtable size 1024 MB 
 * index sampling:
   *  24 bytes + key (16 bytes for UUID) = 32 bytes 
   * 400M / 128 default sampling = 3,125,000
   *  3,125,000 * 32 = 95 MB
   * java fudge X5 or X10 = 475MB to 950MB
 * ignoring row cache and key cache
  
 So the high side number is 2213 to 2,688. High because the fudge is a 
 delicious sticky guess and the memtable space would rarely be full. 
 
 On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some 
 goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the 
 working set and how quickly stuff get's tenured which depends on the 
 workload. 
 
 These values seem reasonable and in line with what I was seeing.  There are 
 other CF and apps sharing this cluster but this one was the largest.  
 
 
   
 
 You can confirm these guesses somewhat manually by enabling all the GC 
 logging in cassandra-env.sh. Restart the node and let it operate normally, 
 probably best to keep repair off.
 
 
 
 I was using jstat to monitor gc activity and some snippets from that are in 
 my original email in this thread.  The key behavior was that full gc was 
 running pretty often and never able to reclaim much (if any) space.
 
 
  
 
 There are a few things you could try:
 
 * increase the JVM heap by say 1Gb and see how it goes
 * increase bloom filter false positive,  try 0.1 first (see 
 http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance)
  
 * increase index_interval sampling in yaml.  
 * decreasing compaction_throughput and in_memory_compaction_limit can lesson 
 the additional memory pressure compaction adds. 
 * disable caches or ensure off heap caches are used.
 
 I've done several of these already in addition to changing the app to reduce 
 the number of rows retained.  How does compaction_throughput relate to memory 
 usage?  I assumed that was more for IO tuning.  I noticed that lowering 
 concurrent_compactors to 4 (from default of 8) lowered the memory used during 
 compactions.  in_memory_compaction_limit_in_mb seems to only be used for wide 
 rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb. 
  My multithreaded_compaction is still false.
 
  
 
 Watching the gc logs and the cassandra log is a great way to get a feel for 
 what works in your situation. Also take note of any scheduled processing your 
 app does which may impact things, and look for poorly performing queries. 
 
 Finally this book is a good reference on Java GC http://amzn.com/0137142528 
 
 For my understanding what was the average row size for the 400 million keys ? 
 
 
 
 The compacted row mean size for the CF is 8815 (as reported by cfstats) but 
 that comes out to be much larger than the real load per node I was seeing.  
 Each node had about 200GB of data for the CF with 4 nodes in the cluster and 
 RF=3.  At the time, the TTL for all columns was 3 days and gc_grace_seconds 
 was 5 days.  Since then I've reduced the TTL to 1 hour and set 
 gc_grace_seconds to 0 so the number of rows and data dropped to a level it 
 can handle.
 
 
 -Bryan
 



Re: What does ReadRepair exactly do?

2012-10-26 Thread aaron morton
 replicas but to ensure we read at least one newest value as long as write
 quorum succeeded beforehand and W+R  N.
 
This is correct.
It's not that a quorum of nodes agree it's that a quorum of nodes participate. 
If a quorum participate in both the write and read you are guaranteed that one 
node was involved in both. The wikipedia definition helps here A quorum is the 
minimum number of members of a deliberative assembly necessary to conduct the 
business of that group http://en.wikipedia.org/wiki/Quorum  

It's a two step process: First do we have enough people to make a decision? 
Second following the rules what was the decision?

In C* the rule is to use the value with the highest time stamp. Not the value 
with the highest number of  votes. The red boxes on this slide are the 
winning values 
http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67  
(thinking one of my slides in that deck may have been misleading in the past). 
In Riak the rule is to use Vector Clocks. 

So 
 I agree that returning val4 is the right thing to do if quorum (two) nodes
 among (node1,node2,node3) have the val4
Is incorrect.
We return the value with the highest time stamp returned from the nodes 
involved in the read. Only one needs to have val4. 

 The heart of the problem
 here is that the coordinator responds to a client request assuming that
 the consistency has been achieved the moment is issues a row repair with the
 super-set of the resolved value; without receiving acknowledgement on the
 success of a repair from the replicas for a given consistency constraint. 
and
 My intuition behind saying this is because we
 would respond to the client without the replicas having confirmed their
 meeting the consistency requirement.

It is not necessary for the coordinator to wait. 

Consider an example: The app has stopped writing to the cluster, for a certain 
column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 
respectively. The last write was a successful CL QUORUM write of bar with 
timestamp 2. However node 3 did acknowledge this write for some reason. 

To make it interesting the commit log volume on node 3 is full. Mutations are 
blocking in the commit log queue so any write on node 3 will timeout and fail, 
but reads are still working. We could imagine this is why node 3 did not commit 
bar:2 

Some read examples, RR is not active:

1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to 
nodes 1 and 2. Both agree on bar as value. 
2) Client reads from node 3 with CL QUORUM, request is processed locally and on 
node 2.
* There is a digest mismatch
* Row Repair read runs to read from for nodes 2 and 3.
* The super set resolves to bar:2
* Node 3 (the coordinator) queues a delta write locally to write bar:2. 
No other delta writes are sent.
* Node 3 returns bar:2 to the client
3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and 
bar:2 is returned. 
4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the 
same thing as (2) happens and bar:2 is returned. 
5) Client reads from node 1 as CL ONE. Read happens locally only and returns 
bar:2
6) Client reads from node 3 as CL ONE. Read happens locally only and returns 
foo:1

So:
* A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on 
disk. 
* A read at CL ONE will return no value or any previous write.

The delta write from the Row Repair goes to a single node so R + W  N cannot 
be applied. It can almost be thought of as  internal implementation. The delta 
write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are 
used to:

* Reduce the chance of a Digest Mismatch when CL  ONE
* Eventually reach a state where reads at any CL return the last write. 

They are not used to ensure strong consistency when R + W  N. You could turn 
those things off and R + W  N would still work. 
 
Hope that helps. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 7:15 AM, shankarpnsn shankarp...@gmail.com wrote:

 manuzhang wrote
 read quorum doesn't mean we read newest values from a quorum number of
 replicas but to ensure we read at least one newest value as long as write
 quorum succeeded beforehand and W+R  N.
 
 I beg to differ here. Any read/write, by definition of quorum, should have
 at least n/2 + 1 replicas that agree on that read/write value. Responding to
 the user with a newer value, even if the write creating the new value hasn't
 completed cannot guarantee any read consistency  1. 
 
 
 Hiller, Dean wrote
 Kind of an interesting question
 
 I think you are saying if a client read resolved only the two nodes as
 said in Aaron's email back to the client and read -repair was kicked off
 because of the inconsistent values and the write did not complete yet and
 I guess you would have two nodes go down to lose the value right after
 the