Re: how to stop hinted handoff

2012-10-25 Thread Tamar Fraenkel
Thanks, that did the trick!

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Thu, Oct 11, 2012 at 3:42 AM, Roshan codeva...@gmail.com wrote:

 Hello

 You can delete the hints from JConsole by using HintedHadOffManager MBean.

 Thanks.






 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-stop-hinted-handoff-tp7583060p7583086.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.

tokLogo.png

Re: What does ReadRepair exactly do?

2012-10-25 Thread aaron morton
It's import to point out the difference between Read Repair, in the context of 
the read_repair_chance setting, and Consistent Reads in the context of the CL 
setting. 

If RR is active on a request it means the request is sent to ALL UP nodes for 
the key and the RR process is ASYNC to the request.If all of the nodes 
involved in the request return to the coordinator before rpc_timeout 
ReadCallback.maybeResolveForRepair() will put a repair task into the 
READ_REPAIR stage. This will compare the values and IF there is a 
DigestMismatch it will start a Row Repair read that reads the data from all 
nodes and MAY result in differences being detected and fixed. 

All of this is outside of the processing of your read request. It is separate 
from the stuff below.

Inside the user read request when ReadCallback.get() is called and CL nodes 
have responded the responses are compared. If a DigestMismatch happens then a 
Row Repair read is started, the result of this read is returned to the user. 
This Row Repair read MAY detect differences, if it does it resolves the super 
set, sends the delta to the replicas and returns the super set value to be 
returned to the client. 

 I'm still missing, how read repairs behave. Just extending your example for
 the following case: 
The example does not use Read Repair, it is handled by Consistent Reads. 

The purpose of RR is to reduce the probability that a read in the future using 
any of the replicas will result in a Digest Mismatch. Any of the replicas 
means ones that were not necessary for this specific read request. 

 2. You do a write operation (W1) with quorom of val=2
 node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
If the write has not completed then it is not a successful write at the 
specified CL as it could fail now.

Therefor the R +W  N Strong Consistency guarantee does not apply at this exact 
point in time. A read to the cluster at this exact point in time using QUOURM 
may return val2 or val1. Again the operation W1 has not completed, if read R' 
starts and completes while W1 is processing it may or may not return the result 
of W1.
 
 In this case, for read R1, the value val2 does not have a quorum. Would read
 R1 return val2 or val4 ? 

If val4 is in the memtable on node before the second read the result will be 
val4.  
Writes that happen between the initial read and the second read after a Digest 
Mismatch are included in the read result.

The way I think about consistency is what value do reads see if writes stop:

* If you have R + W  N, so all writes succeeded at CL QUOURM, all successful 
reads are guaranteed to see the last write. 
* If you are using a low CL and/or had a failed writes at QUOURM then R +  W  
N. All successful reads will *eventually* see the last value written, and they 
are guaranteed to return the value of a previous write or no value. Eventually 
background Read Repair, Hinted Handoff  or nodetool repair will repair the 
inconsistency. 

Hope that helps. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 4:39 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Thanks Zhang. But, this again seems a little strange thing to do, since
 one
 (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
 read failure while there are still enough number of replicas (R1 and R3)
 live to satisfy a read.
 
 
 He means in the case where all 3 nodes are liveŠ.if a node is down,
 naturally it redirects to the other node and still succeeds because it
 found 2 nodes even with one node down(feel free to test this live though
 !)
 
 
 Thanks for the example Dean. This definitely clears things up when you
 have
 an overlap between the read and the write, and one comes after the other.
 I'm still missing, how read repairs behave. Just extending your example
 for
 the following case:
 
 1. node1 = val1 node2 = val1 node3 = val1
 
 2. You do a write operation (W1) with quorom of val=2
 node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
 
 3. Now with a read (R1) from node1 and node2, a read repair will be
 initiated that needs to write val2 on node 1.
 node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not
 complete
 yet)
 
 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
 now arrives at node 1 but sees a newer value val4.
 node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete,
 read
 repair val2 not complete)
 
 In this case, for read R1, the value val2 does not have a quorum. Would
 read
 R1 return val2 or val4 ?
 
 
 At this point as Manu suggests, you need to look at the code but most
 likely what happens is they lock that row, receive the write in memory(ie.
 Not losing it) and return to client, caching it so as soon as read-repair
 is over, it will write that next value.  Ie. Your client would receive
 val2 and val4 would be the value in the database right 

Re: constant CMS GC using CPU time

2012-10-25 Thread aaron morton
 Regarding memory usage after a repair ... Are the merkle trees kept around?
 

They should not be.

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 4:51 PM, B. Todd Burruss bto...@gmail.com wrote:

 Regarding memory usage after a repair ... Are the merkle trees kept around?
 
 On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote:
 On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.com wrote:
 The GC was on-going even when the nodes were not compacting or running a 
 heavy application load -- even when the main app was paused constant the GC 
 continued.
 If you restart a node is the onset of GC activity correlated to some event?
 
 Yes and no.  When the nodes were generally under the .75 occupancy threshold 
 a weekly repair -pr job would cause them to go over the threshold and then 
 stay there even after the repair had completed and there were no ongoing 
 compactions.  It acts as though at least some substantial amount of memory 
 used during repair was never dereferenced once the repair was complete.
 
 Once one CF in particular grew larger the constant GC would start up pretty 
 soon (less than 90 minutes) after a node restart even without a repair.
 
 
  
  
 As a test we dropped the largest CF and the memory usage immediately dropped 
 to acceptable levels and the constant GC stopped.  So it's definitely 
 related to data load.  memtable size is 1 GB, row cache is disabled and key 
 cache is small (default).
 How many keys did the CF have per node? 
 I dismissed the memory used to  hold bloom filters and index sampling. That 
 memory is not considered part of the memtable size, and will end up in the 
 tenured heap. It is generally only a problem with very large key counts per 
 node. 
 
 
 I've changed the app to retain less data for that CF but I think that it was 
 about 400M rows per node.  Row keys are a TimeUUID.  All of the rows are 
 write-once, never updated, and rarely read.  There are no secondary indexes 
 for this particular CF.
 
 
  
  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like 
 the default bloom_filter_fp_chance defaults to 0.0 
 The default should be 0.000744.
 
 If the chance is zero or null this code should run when a new SSTable is 
 written 
   // paranoia -- we've had bugs in the thrift - avro - CfDef dance 
 before, let's not let that break things
 logger.error(Bloom filter FP chance of zero isn't supposed 
 to happen);
 
 Were the CF's migrated from an old version ?
 
 
 Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 
 1.1.5 with a upgradesstables run at each upgrade along the way.
 
 I could not find a way to view the current bloom_filter_fp_chance settings 
 when they are at a default value.  JMX reports the actual fp rate and if a 
 specific rate is set for a CF that shows up in describe table but I 
 couldn't find out how to tell what the default was.  I didn't inspect the 
 source.
 
  
 Is there any way to predict how much memory the bloom filters will consume 
 if the size of the row keys, number or rows is known, and fp chance is known?
 
 See o.a.c.utils.BloomFilter.getFilter() in the code 
 This http://hur.st/bloomfilter appears to give similar results. 
 
 
 
 
 Ahh, very helpful.  This indicates that 714MB would be used for the bloom 
 filter for that one CF.
 
 JMX / cfstats reports Bloom Filter Space Used but the MBean method name 
 (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk 
 and in-memory space used is similar then summing up all the Bloom Filter 
 Space Used says they're currently consuming 1-2 GB of the heap which is 
 substantial.
 
 If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It 
 just means more trips to SSTable indexes for a read correct?  Trade RAM for 
 time (disk I/O).
 
 -Bryan
 



Re: constant CMS GC using CPU time

2012-10-25 Thread aaron morton
 This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.
 
 Yes, I think that was at least part of the issue.

The rough numbers I've used to estimate working set are:

* bloom filter size for 400M rows at 0.00074 fp without java fudge (they are 
just a big array) 714 MB
* memtable size 1024 MB 
* index sampling:
*  24 bytes + key (16 bytes for UUID) = 32 bytes 
* 400M / 128 default sampling = 3,125,000
*  3,125,000 * 32 = 95 MB
* java fudge X5 or X10 = 475MB to 950MB
* ignoring row cache and key cache
 
So the high side number is 2213 to 2,688. High because the fudge is a delicious 
sticky guess and the memtable space would rarely be full. 

On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some goes 
to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the 
working set and how quickly stuff get's tenured which depends on the workload. 

You can confirm these guesses somewhat manually by enabling all the GC logging 
in cassandra-env.sh. Restart the node and let it operate normally, probably 
best to keep repair off.

This is a sample (partial) GC log before CMS kicks in note  concurrent 
mark-sweep used size. Your values may differ, this has non default settings:

Heap after GC invocations=9947 (full 182):
 par new generation   total 1024000K, used 101882K [0x0006fae0, 
0x000745e0, 0x000745e0)
  eden space 819200K,   0% used [0x0006fae0, 0x0006fae0, 
0x00072ce0)
  from space 204800K,  49% used [0x00073960, 0x00073f97eaf8, 
0x000745e0)
  to   space 204800K,   0% used [0x00072ce0, 0x00072ce0, 
0x00073960)
 concurrent mark-sweep generation total 2965504K, used 2309885K 
[0x000745e0, 0x0007fae0, 0x0007fae0)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 
0x0007fd329000, 0x0008)
}

This is when it starts, see (full X) count increases :

2012-10-25T03:32:44.664-0500: 76691.891: [GC [1 CMS-initial-mark: 
2309885K(2965504K)] 2411929K(3989504K), 0.0047910 secs] [Times: user=0.01 
sys=0.00, real=0.01 secs] 
Total time for which application threads were stopped: 0.0059850 seconds

(other CMS type logs)

{Heap before GC invocations=9947 (full 183):
 par new generation   total 1024000K, used 921082K [0x0006fae0, 
0x000745e0, 0x000745e0)
  eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 
0x00072ce0)
  from space 204800K,  49% used [0x00073960, 0x00073f97eaf8, 
0x000745e0)
  to   space 204800K,   0% used [0x00072ce0, 0x00072ce0, 
0x00073960)
 concurrent mark-sweep generation total 2965504K, used 2206292K 
[0x000745e0, 0x0007fae0, 0x0007fae0)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 
0x0007fd329000, 0x0008)

A couple of log messages later concurrent mark-sweep used size is down:

{Heap before GC invocations=9948 (full 183):
 par new generation   total 1024000K, used 938695K [0x0006fae0, 
0x000745e0, 0x000745e0)
  eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 
0x00072ce0)
  from space 204800K,  58% used [0x00072ce0, 0x0007342b1f00, 
0x00073960)
  to   space 204800K,   0% used [0x00073960, 0x00073960, 
0x000745e0)
 concurrent mark-sweep generation total 2965504K, used 1096146K 
[0x000745e0, 0x0007fae0, 0x0007fae0)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 
0x0007fd329000, 0x0008)
2012-10-25T03:32:50.479-0500: 76697.706: [GC Before GC:


There are a few things you could try:

* increase the JVM heap by say 1Gb and see how it goes
* increase bloom filter false positive,  try 0.1 first (see 
http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance)
 
* increase index_interval sampling in yaml.  
* decreasing compaction_throughput and in_memory_compaction_limit can lesson 
the additional memory pressure compaction adds. 
* disable caches or ensure off heap caches are used.

Watching the gc logs and the cassandra log is a great way to get a feel for 
what works in your situation. Also take note of any scheduled processing your 
app does which may impact things, and look for poorly performing queries. 

Finally this book is a good reference on Java GC http://amzn.com/0137142528 

For my understanding what was the average row size for the 400 million keys ? 

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 1:22 PM, Bryan Talbot btal...@aeriagames.com wrote:

 On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote:
 On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot 

Keeping the record straight for Cassandra Benchmarks...

2012-10-25 Thread Brian O'Neill
People probably saw...
http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html

To clarify things take a look at...
http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Keeping the record straight for Cassandra Benchmarks...

2012-10-25 Thread Edward Capriolo
Yes another benchmark with 100,000,000 rows on EC2 machines probably
less powerful then my laptop. The benchmark might as well have run 4
vmware instances on the same desktop.


On Thu, Oct 25, 2012 at 7:40 AM, Brian O'Neill b...@alumni.brown.edu wrote:
 People probably saw...
 http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html

 To clarify things take a look at...
 http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42


Re: Java 7 support?

2012-10-25 Thread Edward Capriolo
I am using the Sun JDK. There are only two issues I have found
unrelated to Cassandra.

1) DateFormat is more liberal mmDD vs yyymmdd If you write an
application with java 7 the format is forgiving with DD vs dd. Yet if
you deploy that application to some JDK 1.6 jvms it fails

2) Ran into some issues with timsort()
http://stackoverflow.com/questions/6626437/why-does-my-compare-method-throw-exception-comparison-method-violates-its-gen

Again neither of these manifested in cassandra but did manifest with
other applications.


On Wed, Oct 24, 2012 at 9:14 PM, Andrey V. Panov panov.a...@gmail.com wrote:
 Are you using openJDK or Oracle JDK? I know java7 should be based on openJDK
 since 7, but still not sure.

 On 25 October 2012 05:42, Edward Capriolo edlinuxg...@gmail.com wrote:

 We have been using cassandra and java7 for months. No problems. A key
 concept of java is portable binaries. There are sometimes wrinkles with
 upgrades. If you hit one undo the upgrade and restart.




Re: What does ReadRepair exactly do?

2012-10-25 Thread Hiller, Dean
Kind of an interesting question

I think you are saying if a client read resolved only the two nodes as
said in Aaron's email back to the client and read -repair was kicked off
because of the inconsistent values and the write did not complete yet and
I guess you would have two nodes go down to lose the value right after the
read, and before write was finished such that the client read a value that
was never stored in the database.  The odds of two nodes going out are
pretty slim though.

Or, what if the node with part of the write went down, as long as the
client stays up, he would complete his write on the other two nodes.
Seems to me as long as two nodes don't fail, you are reading at quorum and
fit with the consistency model since you get a value that will be on two
nodes in the immediate future.

Thanks,
Dean

On 10/25/12 9:45 AM, shankarpnsn shankarp...@gmail.com wrote:

aaron morton wrote
 2. You do a write operation (W1) with quorom of val=2
 node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete
yet)
 If the write has not completed then it is not a successful write at the
 specified CL as it could fail now.
 
 Therefor the R +W  N Strong Consistency guarantee does not apply at
this
 exact point in time. A read to the cluster at this exact point in time
 using QUOURM may return val2 or val1. Again the operation W1 has not
 completed, if read R' starts and completes while W1 is processing it may
 or may not return the result of W1.

I agree completely that it is fair to have this indeterminism in case of
partial/failed/in-flight writes, based on what nodes respond to a
subsequent
read. 


aaron morton wrote
 It's import to point out the difference between Read Repair, in the
 context of the read_repair_chance setting, and Consistent Reads in the
 context of the CL setting. All of this is outside of the processing of
 your read request. It is separate from the stuff below.
 
 Inside the user read request when ReadCallback.get() is called and CL
 nodes have responded the responses are compared. If a DigestMismatch
 happens then a Row Repair read is started, the result of this read is
 returned to the user. This Row Repair read MAY detect differences, if it
 does it resolves the super set, sends the delta to the replicas and
 returns the super set value to be returned to the client.
 
 In this case, for read R1, the value val2 does not have a quorum. Would
 read
 R1 return val2 or val4 ?
 
 If val4 is in the memtable on node before the second read the result
will
 be val4.  
 Writes that happen between the initial read and the second read after a
 Digest Mismatch are included in the read result.

Thanks for clarifying this, Aaron. This is very much in line with what I
figured out from the code and brings me back to my initial question on the
point of when and what the user/client gets to see as the read result. Let
us, for now, consider only the repairs initiated as a part of /consistent
reads/. If the Row Repair (after resolving and sending the deltas to
replicas, but not waiting for a quorum success after the repair) returns
the
super set value immediately to the user, wouldn't it be a breach of the
consistent reads paradigm? My intuition behind saying this is because we
would respond to the client without the replicas having confirmed their
meeting the consistency requirement.

I agree that returning val4 is the right thing to do if quorum (two) nodes
among (node1,node2,node3) have the val4 at the second read after digest
mismatch. But wouldn't it be incorrect to respond to user with any value
when the second read (after mismatch) doesn't find a quorum. So after
sending the deltas to the replicas as a part of the repair (still a part
of
/consistent reads/), shouldn't the value be read again to check for the
presence of a quorum after the repair?

In the example we had, assume the mismatch is detected during a read R1
from
coordinator node C, that reaches node1, node2
State seen by C after first read R1:  node1 = val1, node2 = val 2, node3
=
val1

A second read is initiated as a part of repair for consistent read of R1.
This second read observes the values (val1, val2) from (node1, node2) and
sends the corresponding row repair delta to node1. I'm guessing C cannot
respond back to user with val2 until C knows that node1 has actually
written
the value val2 thereby meeting the quorum. Is this interpretation correct
?






--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
-ReadRepair-exactly-do-tp7583261p7583395.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at
Nabble.com.



Large results and network round trips

2012-10-25 Thread Edward Capriolo
Hello all,

Currently we implement wide rows for most of our entities. For example:

user {
 event1=x
 event2=y
 event3=z
 ...
}

Normally the entires are bounded to be less then 256 columns and most
columns are small in size say 30 bytes. Because the blind write nature
of Cassandra it is possible the column family can get much larger. We
have very low latency requirements for example say less then (5ms).

Considering network rountrip and all other factors I am wondering what
is the largest column that is possible in a 5ms window on a GB
network.  First we have our thrift limits 15MB, is it possible even in
the best case scenario to deliver a 15MB response in under 5ms on a
GigaBit ethernet for example? Does anyone have any real world numbers
with reference to package sizes and standard performance?

Thanks all,
Edward


Re: What does ReadRepair exactly do?

2012-10-25 Thread Manu Zhang
read quorum doesn't mean we read newest values from a quorum number of
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R  N.

On Fri, Oct 26, 2012 at 12:00 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Kind of an interesting question

 I think you are saying if a client read resolved only the two nodes as
 said in Aaron's email back to the client and read -repair was kicked off
 because of the inconsistent values and the write did not complete yet and
 I guess you would have two nodes go down to lose the value right after the
 read, and before write was finished such that the client read a value that
 was never stored in the database.  The odds of two nodes going out are
 pretty slim though.

 Or, what if the node with part of the write went down, as long as the
 client stays up, he would complete his write on the other two nodes.
 Seems to me as long as two nodes don't fail, you are reading at quorum and
 fit with the consistency model since you get a value that will be on two
 nodes in the immediate future.

 Thanks,
 Dean

 On 10/25/12 9:45 AM, shankarpnsn shankarp...@gmail.com wrote:

 aaron morton wrote
  2. You do a write operation (W1) with quorom of val=2
  node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete
 yet)
  If the write has not completed then it is not a successful write at the
  specified CL as it could fail now.
 
  Therefor the R +W  N Strong Consistency guarantee does not apply at
 this
  exact point in time. A read to the cluster at this exact point in time
  using QUOURM may return val2 or val1. Again the operation W1 has not
  completed, if read R' starts and completes while W1 is processing it may
  or may not return the result of W1.
 
 I agree completely that it is fair to have this indeterminism in case of
 partial/failed/in-flight writes, based on what nodes respond to a
 subsequent
 read.
 
 
 aaron morton wrote
  It's import to point out the difference between Read Repair, in the
  context of the read_repair_chance setting, and Consistent Reads in the
  context of the CL setting. All of this is outside of the processing of
  your read request. It is separate from the stuff below.
 
  Inside the user read request when ReadCallback.get() is called and CL
  nodes have responded the responses are compared. If a DigestMismatch
  happens then a Row Repair read is started, the result of this read is
  returned to the user. This Row Repair read MAY detect differences, if it
  does it resolves the super set, sends the delta to the replicas and
  returns the super set value to be returned to the client.
 
  In this case, for read R1, the value val2 does not have a quorum. Would
  read
  R1 return val2 or val4 ?
 
  If val4 is in the memtable on node before the second read the result
 will
  be val4.
  Writes that happen between the initial read and the second read after a
  Digest Mismatch are included in the read result.
 
 Thanks for clarifying this, Aaron. This is very much in line with what I
 figured out from the code and brings me back to my initial question on the
 point of when and what the user/client gets to see as the read result. Let
 us, for now, consider only the repairs initiated as a part of /consistent
 reads/. If the Row Repair (after resolving and sending the deltas to
 replicas, but not waiting for a quorum success after the repair) returns
 the
 super set value immediately to the user, wouldn't it be a breach of the
 consistent reads paradigm? My intuition behind saying this is because we
 would respond to the client without the replicas having confirmed their
 meeting the consistency requirement.
 
 I agree that returning val4 is the right thing to do if quorum (two) nodes
 among (node1,node2,node3) have the val4 at the second read after digest
 mismatch. But wouldn't it be incorrect to respond to user with any value
 when the second read (after mismatch) doesn't find a quorum. So after
 sending the deltas to the replicas as a part of the repair (still a part
 of
 /consistent reads/), shouldn't the value be read again to check for the
 presence of a quorum after the repair?
 
 In the example we had, assume the mismatch is detected during a read R1
 from
 coordinator node C, that reaches node1, node2
 State seen by C after first read R1:  node1 = val1, node2 = val 2, node3
 =
 val1
 
 A second read is initiated as a part of repair for consistent read of R1.
 This second read observes the values (val1, val2) from (node1, node2) and
 sends the corresponding row repair delta to node1. I'm guessing C cannot
 respond back to user with val2 until C knows that node1 has actually
 written
 the value val2 thereby meeting the quorum. Is this interpretation correct
 ?
 
 
 
 
 
 
 --
 View this message in context:
 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
 -ReadRepair-exactly-do-tp7583261p7583395.html
 Sent from the cassandra-u...@incubator.apache.org mailing list 

High bandwidth usage between datacenters for cluster

2012-10-25 Thread Bryce Godfrey
We have a 5 node cluster, with a matching 5 nodes for DR in another data 
center.   With a replication factor of 3, does the node I send a write too 
attempt to send it to the 3 servers in the DR also?  Or does it send it to 1 
and let it replicate locally in the DR environment to save bandwidth across the 
WAN?
Normally this isn't an issue for us, but at times we are writing approximately 
1MB a sec of data, and seeing a corresponding 3MB of traffic across the WAN to 
all the Cassandra DR servers.

If my assumptions are right, is this configurable somehow for writing to one 
node and letting it do local replication?  We are on 1.1.5

Thanks


Re: What does ReadRepair exactly do?

2012-10-25 Thread shankarpnsn
manuzhang wrote
 read quorum doesn't mean we read newest values from a quorum number of
 replicas but to ensure we read at least one newest value as long as write
 quorum succeeded beforehand and W+R  N.

I beg to differ here. Any read/write, by definition of quorum, should have
at least n/2 + 1 replicas that agree on that read/write value. Responding to
the user with a newer value, even if the write creating the new value hasn't
completed cannot guarantee any read consistency  1. 


Hiller, Dean wrote
 Kind of an interesting question

 I think you are saying if a client read resolved only the two nodes as
 said in Aaron's email back to the client and read -repair was kicked off
 because of the inconsistent values and the write did not complete yet and
 I guess you would have two nodes go down to lose the value right after
 the
 read, and before write was finished such that the client read a value
 that
 was never stored in the database.  The odds of two nodes going out are
 pretty slim though.
 Thanks,
 Dean

Bingo! I do understand that the odds of a quorum nodes going down are low
and that any subsequent read would achieve a quorum. However, I'm wondering
what would be the right thing to do here, given that the client has
particularly asked for a certain consistency on the read and cassandra
returns a value that doesn't have the consistency. The heart of the problem
here is that the coordinator responds to a client request assuming that
the consistency has been achieved the moment is issues a row repair with the
super-set of the resolved value; without receiving acknowledgement on the
success of a repair from the replicas for a given consistency constraint. 

In order to adhere to the given consistency specification, the row repair
(due to consistent reads) should repeat the read after issuing a
consistency repair to ensure if the consistency is met. Like Manu
mentioned, this could of course lead to a number of repeat reads if the
writes arrive quickly - until the read gets timed out. However, note that we
would still be honoring the consistency constraint for that read. 



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: High bandwidth usage between datacenters for cluster

2012-10-25 Thread Hiller, Dean
Use the datacenter replication strategy and try it with that so you tell 
cassandra all your data centers, racks, etc.

Dean

From: Bryce Godfrey 
bryce.godf...@azaleos.commailto:bryce.godf...@azaleos.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, October 25, 2012 11:44 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: High bandwidth usage between datacenters for cluster

We have a 5 node cluster, with a matching 5 nodes for DR in another data 
center.   With a replication factor of 3, does the node I send a write too 
attempt to send it to the 3 servers in the DR also?  Or does it send it to 1 
and let it replicate locally in the DR environment to save bandwidth across the 
WAN?
Normally this isn’t an issue for us, but at times we are writing approximately 
1MB a sec of data, and seeing a corresponding 3MB of traffic across the WAN to 
all the Cassandra DR servers.

If my assumptions are right, is this configurable somehow for writing to one 
node and letting it do local replication?  We are on 1.1.5

Thanks


Re: High bandwidth usage between datacenters for cluster

2012-10-25 Thread sankalp kohli
Use placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and also fill the
topology.properties file. This will tell cassandra that you have two DCs.
You can verify that by looking at output of the ring command.

If you DCs are setup properly, only one request will go over WAN. Though
the responses from all nodes in other DC will go over WAN.

On Thu, Oct 25, 2012 at 10:44 AM, Bryce Godfrey
bryce.godf...@azaleos.comwrote:

  We have a 5 node cluster, with a matching 5 nodes for DR in another data
 center.   With a replication factor of 3, does the node I send a write too
 attempt to send it to the 3 servers in the DR also?  Or does it send it to
 1 and let it replicate locally in the DR environment to save bandwidth
 across the WAN?

 Normally this isn’t an issue for us, but at times we are writing
 approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic
 across the WAN to all the Cassandra DR servers.

 ** **

 If my assumptions are right, is this configurable somehow for writing to
 one node and letting it do local replication?  We are on 1.1.5

 ** **

 Thanks



Re: Large results and network round trips

2012-10-25 Thread sankalp kohli
I dont have any sample data on this, but read latency will depend on these
1) Consistency level of the read
2) Disk speed.

Also you can look at the Netflix client as it makes the co-ordinator node
same as the node which holds that data. This will reduce one hop.

On Thu, Oct 25, 2012 at 9:04 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Hello all,

 Currently we implement wide rows for most of our entities. For example:

 user {
  event1=x
  event2=y
  event3=z
  ...
 }

 Normally the entires are bounded to be less then 256 columns and most
 columns are small in size say 30 bytes. Because the blind write nature
 of Cassandra it is possible the column family can get much larger. We
 have very low latency requirements for example say less then (5ms).

 Considering network rountrip and all other factors I am wondering what
 is the largest column that is possible in a 5ms window on a GB
 network.  First we have our thrift limits 15MB, is it possible even in
 the best case scenario to deliver a 15MB response in under 5ms on a
 GigaBit ethernet for example? Does anyone have any real world numbers
 with reference to package sizes and standard performance?

 Thanks all,
 Edward



Re: constant CMS GC using CPU time

2012-10-25 Thread Bryan Talbot
On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.comwrote:

  This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.


 Yes, I think that was at least part of the issue.


 The rough numbers I've used to estimate working set are:

 * bloom filter size for 400M rows at 0.00074 fp without java fudge (they
 are just a big array) 714 MB
 * memtable size 1024 MB
 * index sampling:
 *  24 bytes + key (16 bytes for UUID) = 32 bytes
  * 400M / 128 default sampling = 3,125,000
 *  3,125,000 * 32 = 95 MB
  * java fudge X5 or X10 = 475MB to 950MB
 * ignoring row cache and key cache

 So the high side number is 2213 to 2,688. High because the fudge is a
 delicious sticky guess and the memtable space would rarely be full.

 On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some
 goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on
 the working set and how quickly stuff get's tenured which depends on the
 workload.


These values seem reasonable and in line with what I was seeing.  There are
other CF and apps sharing this cluster but this one was the largest.





 You can confirm these guesses somewhat manually by enabling all the GC
 logging in cassandra-env.sh. Restart the node and let it operate normally,
 probably best to keep repair off.



I was using jstat to monitor gc activity and some snippets from that are in
my original email in this thread.  The key behavior was that full gc was
running pretty often and never able to reclaim much (if any) space.





 There are a few things you could try:

 * increase the JVM heap by say 1Gb and see how it goes
 * increase bloom filter false positive,  try 0.1 first (see
 http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance
 )
 * increase index_interval sampling in yaml.
 * decreasing compaction_throughput and in_memory_compaction_limit can
 lesson the additional memory pressure compaction adds.
 * disable caches or ensure off heap caches are used.


I've done several of these already in addition to changing the app to
reduce the number of rows retained.  How does compaction_throughput relate
to memory usage?  I assumed that was more for IO tuning.  I noticed that
lowering concurrent_compactors to 4 (from default of 8) lowered the memory
used during compactions.  in_memory_compaction_limit_in_mb seems to only be
used for wide rows and this CF didn't have any wider
than in_memory_compaction_limit_in_mb.  My multithreaded_compaction is
still false.




 Watching the gc logs and the cassandra log is a great way to get a feel
 for what works in your situation. Also take note of any scheduled
 processing your app does which may impact things, and look for poorly
 performing queries.

 Finally this book is a good reference on Java GC
 http://amzn.com/0137142528

 For my understanding what was the average row size for the 400 million
 keys ?



The compacted row mean size for the CF is 8815 (as reported by cfstats) but
that comes out to be much larger than the real load per node I was seeing.
 Each node had about 200GB of data for the CF with 4 nodes in the cluster
and RF=3.  At the time, the TTL for all columns was 3 days and
gc_grace_seconds was 5 days.  Since then I've reduced the TTL to 1 hour and
set gc_grace_seconds to 0 so the number of rows and data dropped to a level
it can handle.


-Bryan


Re: Large results and network round trips

2012-10-25 Thread Edward Capriolo
For this scenario, remove disk speed from the equation. Assume the row
is completely in Row Cache. Also lets assume Read.ONE. With this
information I would be looking to determine response size/maximum
requests second/max latency.

I would use this to say You want to do 5,000 reads/sec, on a GigaBit
ethernet, and each row is 10K, in under 5ms latency

Sorry that is impossible.




On Thu, Oct 25, 2012 at 2:58 PM, sankalp kohli kohlisank...@gmail.com wrote:
 I dont have any sample data on this, but read latency will depend on these
 1) Consistency level of the read
 2) Disk speed.

 Also you can look at the Netflix client as it makes the co-ordinator node
 same as the node which holds that data. This will reduce one hop.

 On Thu, Oct 25, 2012 at 9:04 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Hello all,

 Currently we implement wide rows for most of our entities. For example:

 user {
  event1=x
  event2=y
  event3=z
  ...
 }

 Normally the entires are bounded to be less then 256 columns and most
 columns are small in size say 30 bytes. Because the blind write nature
 of Cassandra it is possible the column family can get much larger. We
 have very low latency requirements for example say less then (5ms).

 Considering network rountrip and all other factors I am wondering what
 is the largest column that is possible in a 5ms window on a GB
 network.  First we have our thrift limits 15MB, is it possible even in
 the best case scenario to deliver a 15MB response in under 5ms on a
 GigaBit ethernet for example? Does anyone have any real world numbers
 with reference to package sizes and standard performance?

 Thanks all,
 Edward