Re: how to stop hinted handoff
Thanks, that did the trick! *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Thu, Oct 11, 2012 at 3:42 AM, Roshan codeva...@gmail.com wrote: Hello You can delete the hints from JConsole by using HintedHadOffManager MBean. Thanks. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/how-to-stop-hinted-handoff-tp7583060p7583086.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. tokLogo.png
Re: What does ReadRepair exactly do?
It's import to point out the difference between Read Repair, in the context of the read_repair_chance setting, and Consistent Reads in the context of the CL setting. If RR is active on a request it means the request is sent to ALL UP nodes for the key and the RR process is ASYNC to the request.If all of the nodes involved in the request return to the coordinator before rpc_timeout ReadCallback.maybeResolveForRepair() will put a repair task into the READ_REPAIR stage. This will compare the values and IF there is a DigestMismatch it will start a Row Repair read that reads the data from all nodes and MAY result in differences being detected and fixed. All of this is outside of the processing of your read request. It is separate from the stuff below. Inside the user read request when ReadCallback.get() is called and CL nodes have responded the responses are compared. If a DigestMismatch happens then a Row Repair read is started, the result of this read is returned to the user. This Row Repair read MAY detect differences, if it does it resolves the super set, sends the delta to the replicas and returns the super set value to be returned to the client. I'm still missing, how read repairs behave. Just extending your example for the following case: The example does not use Read Repair, it is handled by Consistent Reads. The purpose of RR is to reduce the probability that a read in the future using any of the replicas will result in a Digest Mismatch. Any of the replicas means ones that were not necessary for this specific read request. 2. You do a write operation (W1) with quorom of val=2 node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) If the write has not completed then it is not a successful write at the specified CL as it could fail now. Therefor the R +W N Strong Consistency guarantee does not apply at this exact point in time. A read to the cluster at this exact point in time using QUOURM may return val2 or val1. Again the operation W1 has not completed, if read R' starts and completes while W1 is processing it may or may not return the result of W1. In this case, for read R1, the value val2 does not have a quorum. Would read R1 return val2 or val4 ? If val4 is in the memtable on node before the second read the result will be val4. Writes that happen between the initial read and the second read after a Digest Mismatch are included in the read result. The way I think about consistency is what value do reads see if writes stop: * If you have R + W N, so all writes succeeded at CL QUOURM, all successful reads are guaranteed to see the last write. * If you are using a low CL and/or had a failed writes at QUOURM then R + W N. All successful reads will *eventually* see the last value written, and they are guaranteed to return the value of a previous write or no value. Eventually background Read Repair, Hinted Handoff or nodetool repair will repair the inconsistency. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/10/2012, at 4:39 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Thanks Zhang. But, this again seems a little strange thing to do, since one (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a read failure while there are still enough number of replicas (R1 and R3) live to satisfy a read. He means in the case where all 3 nodes are liveŠ.if a node is down, naturally it redirects to the other node and still succeeds because it found 2 nodes even with one node down(feel free to test this live though !) Thanks for the example Dean. This definitely clears things up when you have an overlap between the read and the write, and one comes after the other. I'm still missing, how read repairs behave. Just extending your example for the following case: 1. node1 = val1 node2 = val1 node3 = val1 2. You do a write operation (W1) with quorom of val=2 node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) 3. Now with a read (R1) from node1 and node2, a read repair will be initiated that needs to write val2 on node 1. node1 = val1; node2 = val2; node3 = val1 (read repair val2 is not complete yet) 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1 now arrives at node 1 but sees a newer value val4. node1 = val4; node2 = val2; node3 = val1 (write val4 is not complete, read repair val2 not complete) In this case, for read R1, the value val2 does not have a quorum. Would read R1 return val2 or val4 ? At this point as Manu suggests, you need to look at the code but most likely what happens is they lock that row, receive the write in memory(ie. Not losing it) and return to client, caching it so as soon as read-repair is over, it will write that next value. Ie. Your client would receive val2 and val4 would be the value in the database right
Re: constant CMS GC using CPU time
Regarding memory usage after a repair ... Are the merkle trees kept around? They should not be. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/10/2012, at 4:51 PM, B. Todd Burruss bto...@gmail.com wrote: Regarding memory usage after a repair ... Are the merkle trees kept around? On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote: On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.com wrote: The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. If you restart a node is the onset of GC activity correlated to some event? Yes and no. When the nodes were generally under the .75 occupancy threshold a weekly repair -pr job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions. It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete. Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). How many keys did the CF have per node? I dismissed the memory used to hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. I've changed the app to retain less data for that CF but I think that it was about 400M rows per node. Row keys are a TimeUUID. All of the rows are write-once, never updated, and rarely read. There are no secondary indexes for this particular CF. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 The default should be 0.000744. If the chance is zero or null this code should run when a new SSTable is written // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, let's not let that break things logger.error(Bloom filter FP chance of zero isn't supposed to happen); Were the CF's migrated from an old version ? Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a upgradesstables run at each upgrade along the way. I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value. JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in describe table but I couldn't find out how to tell what the default was. I didn't inspect the source. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? See o.a.c.utils.BloomFilter.getFilter() in the code This http://hur.st/bloomfilter appears to give similar results. Ahh, very helpful. This indicates that 714MB would be used for the bloom filter for that one CF. JMX / cfstats reports Bloom Filter Space Used but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the Bloom Filter Space Used says they're currently consuming 1-2 GB of the heap which is substantial. If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0? It just means more trips to SSTable indexes for a read correct? Trade RAM for time (disk I/O). -Bryan
Re: constant CMS GC using CPU time
This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Yes, I think that was at least part of the issue. The rough numbers I've used to estimate working set are: * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB * memtable size 1024 MB * index sampling: * 24 bytes + key (16 bytes for UUID) = 32 bytes * 400M / 128 default sampling = 3,125,000 * 3,125,000 * 32 = 95 MB * java fudge X5 or X10 = 475MB to 950MB * ignoring row cache and key cache So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. On a 5120 MB heap, with 800MB new you have roughly 4300 MB tenured (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off. This is a sample (partial) GC log before CMS kicks in note concurrent mark-sweep used size. Your values may differ, this has non default settings: Heap after GC invocations=9947 (full 182): par new generation total 1024000K, used 101882K [0x0006fae0, 0x000745e0, 0x000745e0) eden space 819200K, 0% used [0x0006fae0, 0x0006fae0, 0x00072ce0) from space 204800K, 49% used [0x00073960, 0x00073f97eaf8, 0x000745e0) to space 204800K, 0% used [0x00072ce0, 0x00072ce0, 0x00073960) concurrent mark-sweep generation total 2965504K, used 2309885K [0x000745e0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 0x0007fd329000, 0x0008) } This is when it starts, see (full X) count increases : 2012-10-25T03:32:44.664-0500: 76691.891: [GC [1 CMS-initial-mark: 2309885K(2965504K)] 2411929K(3989504K), 0.0047910 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] Total time for which application threads were stopped: 0.0059850 seconds (other CMS type logs) {Heap before GC invocations=9947 (full 183): par new generation total 1024000K, used 921082K [0x0006fae0, 0x000745e0, 0x000745e0) eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 0x00072ce0) from space 204800K, 49% used [0x00073960, 0x00073f97eaf8, 0x000745e0) to space 204800K, 0% used [0x00072ce0, 0x00072ce0, 0x00073960) concurrent mark-sweep generation total 2965504K, used 2206292K [0x000745e0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 0x0007fd329000, 0x0008) A couple of log messages later concurrent mark-sweep used size is down: {Heap before GC invocations=9948 (full 183): par new generation total 1024000K, used 938695K [0x0006fae0, 0x000745e0, 0x000745e0) eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 0x00072ce0) from space 204800K, 58% used [0x00072ce0, 0x0007342b1f00, 0x00073960) to space 204800K, 0% used [0x00073960, 0x00073960, 0x000745e0) concurrent mark-sweep generation total 2965504K, used 1096146K [0x000745e0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 0x0007fd329000, 0x0008) 2012-10-25T03:32:50.479-0500: 76697.706: [GC Before GC: There are a few things you could try: * increase the JVM heap by say 1Gb and see how it goes * increase bloom filter false positive, try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance) * increase index_interval sampling in yaml. * decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. * disable caches or ensure off heap caches are used. Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. Finally this book is a good reference on Java GC http://amzn.com/0137142528 For my understanding what was the average row size for the 400 million keys ? Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/10/2012, at 1:22 PM, Bryan Talbot btal...@aeriagames.com wrote: On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote: On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot
Keeping the record straight for Cassandra Benchmarks...
People probably saw... http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html To clarify things take a look at... http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Keeping the record straight for Cassandra Benchmarks...
Yes another benchmark with 100,000,000 rows on EC2 machines probably less powerful then my laptop. The benchmark might as well have run 4 vmware instances on the same desktop. On Thu, Oct 25, 2012 at 7:40 AM, Brian O'Neill b...@alumni.brown.edu wrote: People probably saw... http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html To clarify things take a look at... http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Java 7 support?
I am using the Sun JDK. There are only two issues I have found unrelated to Cassandra. 1) DateFormat is more liberal mmDD vs yyymmdd If you write an application with java 7 the format is forgiving with DD vs dd. Yet if you deploy that application to some JDK 1.6 jvms it fails 2) Ran into some issues with timsort() http://stackoverflow.com/questions/6626437/why-does-my-compare-method-throw-exception-comparison-method-violates-its-gen Again neither of these manifested in cassandra but did manifest with other applications. On Wed, Oct 24, 2012 at 9:14 PM, Andrey V. Panov panov.a...@gmail.com wrote: Are you using openJDK or Oracle JDK? I know java7 should be based on openJDK since 7, but still not sure. On 25 October 2012 05:42, Edward Capriolo edlinuxg...@gmail.com wrote: We have been using cassandra and java7 for months. No problems. A key concept of java is portable binaries. There are sometimes wrinkles with upgrades. If you hit one undo the upgrade and restart.
Re: What does ReadRepair exactly do?
Kind of an interesting question I think you are saying if a client read resolved only the two nodes as said in Aaron's email back to the client and read -repair was kicked off because of the inconsistent values and the write did not complete yet and I guess you would have two nodes go down to lose the value right after the read, and before write was finished such that the client read a value that was never stored in the database. The odds of two nodes going out are pretty slim though. Or, what if the node with part of the write went down, as long as the client stays up, he would complete his write on the other two nodes. Seems to me as long as two nodes don't fail, you are reading at quorum and fit with the consistency model since you get a value that will be on two nodes in the immediate future. Thanks, Dean On 10/25/12 9:45 AM, shankarpnsn shankarp...@gmail.com wrote: aaron morton wrote 2. You do a write operation (W1) with quorom of val=2 node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) If the write has not completed then it is not a successful write at the specified CL as it could fail now. Therefor the R +W N Strong Consistency guarantee does not apply at this exact point in time. A read to the cluster at this exact point in time using QUOURM may return val2 or val1. Again the operation W1 has not completed, if read R' starts and completes while W1 is processing it may or may not return the result of W1. I agree completely that it is fair to have this indeterminism in case of partial/failed/in-flight writes, based on what nodes respond to a subsequent read. aaron morton wrote It's import to point out the difference between Read Repair, in the context of the read_repair_chance setting, and Consistent Reads in the context of the CL setting. All of this is outside of the processing of your read request. It is separate from the stuff below. Inside the user read request when ReadCallback.get() is called and CL nodes have responded the responses are compared. If a DigestMismatch happens then a Row Repair read is started, the result of this read is returned to the user. This Row Repair read MAY detect differences, if it does it resolves the super set, sends the delta to the replicas and returns the super set value to be returned to the client. In this case, for read R1, the value val2 does not have a quorum. Would read R1 return val2 or val4 ? If val4 is in the memtable on node before the second read the result will be val4. Writes that happen between the initial read and the second read after a Digest Mismatch are included in the read result. Thanks for clarifying this, Aaron. This is very much in line with what I figured out from the code and brings me back to my initial question on the point of when and what the user/client gets to see as the read result. Let us, for now, consider only the repairs initiated as a part of /consistent reads/. If the Row Repair (after resolving and sending the deltas to replicas, but not waiting for a quorum success after the repair) returns the super set value immediately to the user, wouldn't it be a breach of the consistent reads paradigm? My intuition behind saying this is because we would respond to the client without the replicas having confirmed their meeting the consistency requirement. I agree that returning val4 is the right thing to do if quorum (two) nodes among (node1,node2,node3) have the val4 at the second read after digest mismatch. But wouldn't it be incorrect to respond to user with any value when the second read (after mismatch) doesn't find a quorum. So after sending the deltas to the replicas as a part of the repair (still a part of /consistent reads/), shouldn't the value be read again to check for the presence of a quorum after the repair? In the example we had, assume the mismatch is detected during a read R1 from coordinator node C, that reaches node1, node2 State seen by C after first read R1: node1 = val1, node2 = val 2, node3 = val1 A second read is initiated as a part of repair for consistent read of R1. This second read observes the values (val1, val2) from (node1, node2) and sends the corresponding row repair delta to node1. I'm guessing C cannot respond back to user with val2 until C knows that node1 has actually written the value val2 thereby meeting the quorum. Is this interpretation correct ? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583395.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Large results and network round trips
Hello all, Currently we implement wide rows for most of our entities. For example: user { event1=x event2=y event3=z ... } Normally the entires are bounded to be less then 256 columns and most columns are small in size say 30 bytes. Because the blind write nature of Cassandra it is possible the column family can get much larger. We have very low latency requirements for example say less then (5ms). Considering network rountrip and all other factors I am wondering what is the largest column that is possible in a 5ms window on a GB network. First we have our thrift limits 15MB, is it possible even in the best case scenario to deliver a 15MB response in under 5ms on a GigaBit ethernet for example? Does anyone have any real world numbers with reference to package sizes and standard performance? Thanks all, Edward
Re: What does ReadRepair exactly do?
read quorum doesn't mean we read newest values from a quorum number of replicas but to ensure we read at least one newest value as long as write quorum succeeded beforehand and W+R N. On Fri, Oct 26, 2012 at 12:00 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Kind of an interesting question I think you are saying if a client read resolved only the two nodes as said in Aaron's email back to the client and read -repair was kicked off because of the inconsistent values and the write did not complete yet and I guess you would have two nodes go down to lose the value right after the read, and before write was finished such that the client read a value that was never stored in the database. The odds of two nodes going out are pretty slim though. Or, what if the node with part of the write went down, as long as the client stays up, he would complete his write on the other two nodes. Seems to me as long as two nodes don't fail, you are reading at quorum and fit with the consistency model since you get a value that will be on two nodes in the immediate future. Thanks, Dean On 10/25/12 9:45 AM, shankarpnsn shankarp...@gmail.com wrote: aaron morton wrote 2. You do a write operation (W1) with quorom of val=2 node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) If the write has not completed then it is not a successful write at the specified CL as it could fail now. Therefor the R +W N Strong Consistency guarantee does not apply at this exact point in time. A read to the cluster at this exact point in time using QUOURM may return val2 or val1. Again the operation W1 has not completed, if read R' starts and completes while W1 is processing it may or may not return the result of W1. I agree completely that it is fair to have this indeterminism in case of partial/failed/in-flight writes, based on what nodes respond to a subsequent read. aaron morton wrote It's import to point out the difference between Read Repair, in the context of the read_repair_chance setting, and Consistent Reads in the context of the CL setting. All of this is outside of the processing of your read request. It is separate from the stuff below. Inside the user read request when ReadCallback.get() is called and CL nodes have responded the responses are compared. If a DigestMismatch happens then a Row Repair read is started, the result of this read is returned to the user. This Row Repair read MAY detect differences, if it does it resolves the super set, sends the delta to the replicas and returns the super set value to be returned to the client. In this case, for read R1, the value val2 does not have a quorum. Would read R1 return val2 or val4 ? If val4 is in the memtable on node before the second read the result will be val4. Writes that happen between the initial read and the second read after a Digest Mismatch are included in the read result. Thanks for clarifying this, Aaron. This is very much in line with what I figured out from the code and brings me back to my initial question on the point of when and what the user/client gets to see as the read result. Let us, for now, consider only the repairs initiated as a part of /consistent reads/. If the Row Repair (after resolving and sending the deltas to replicas, but not waiting for a quorum success after the repair) returns the super set value immediately to the user, wouldn't it be a breach of the consistent reads paradigm? My intuition behind saying this is because we would respond to the client without the replicas having confirmed their meeting the consistency requirement. I agree that returning val4 is the right thing to do if quorum (two) nodes among (node1,node2,node3) have the val4 at the second read after digest mismatch. But wouldn't it be incorrect to respond to user with any value when the second read (after mismatch) doesn't find a quorum. So after sending the deltas to the replicas as a part of the repair (still a part of /consistent reads/), shouldn't the value be read again to check for the presence of a quorum after the repair? In the example we had, assume the mismatch is detected during a read R1 from coordinator node C, that reaches node1, node2 State seen by C after first read R1: node1 = val1, node2 = val 2, node3 = val1 A second read is initiated as a part of repair for consistent read of R1. This second read observes the values (val1, val2) from (node1, node2) and sends the corresponding row repair delta to node1. I'm guessing C cannot respond back to user with val2 until C knows that node1 has actually written the value val2 thereby meeting the quorum. Is this interpretation correct ? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583395.html Sent from the cassandra-u...@incubator.apache.org mailing list
High bandwidth usage between datacenters for cluster
We have a 5 node cluster, with a matching 5 nodes for DR in another data center. With a replication factor of 3, does the node I send a write too attempt to send it to the 3 servers in the DR also? Or does it send it to 1 and let it replicate locally in the DR environment to save bandwidth across the WAN? Normally this isn't an issue for us, but at times we are writing approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic across the WAN to all the Cassandra DR servers. If my assumptions are right, is this configurable somehow for writing to one node and letting it do local replication? We are on 1.1.5 Thanks
Re: What does ReadRepair exactly do?
manuzhang wrote read quorum doesn't mean we read newest values from a quorum number of replicas but to ensure we read at least one newest value as long as write quorum succeeded beforehand and W+R N. I beg to differ here. Any read/write, by definition of quorum, should have at least n/2 + 1 replicas that agree on that read/write value. Responding to the user with a newer value, even if the write creating the new value hasn't completed cannot guarantee any read consistency 1. Hiller, Dean wrote Kind of an interesting question I think you are saying if a client read resolved only the two nodes as said in Aaron's email back to the client and read -repair was kicked off because of the inconsistent values and the write did not complete yet and I guess you would have two nodes go down to lose the value right after the read, and before write was finished such that the client read a value that was never stored in the database. The odds of two nodes going out are pretty slim though. Thanks, Dean Bingo! I do understand that the odds of a quorum nodes going down are low and that any subsequent read would achieve a quorum. However, I'm wondering what would be the right thing to do here, given that the client has particularly asked for a certain consistency on the read and cassandra returns a value that doesn't have the consistency. The heart of the problem here is that the coordinator responds to a client request assuming that the consistency has been achieved the moment is issues a row repair with the super-set of the resolved value; without receiving acknowledgement on the success of a repair from the replicas for a given consistency constraint. In order to adhere to the given consistency specification, the row repair (due to consistent reads) should repeat the read after issuing a consistency repair to ensure if the consistency is met. Like Manu mentioned, this could of course lead to a number of repeat reads if the writes arrive quickly - until the read gets timed out. However, note that we would still be honoring the consistency constraint for that read. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: High bandwidth usage between datacenters for cluster
Use the datacenter replication strategy and try it with that so you tell cassandra all your data centers, racks, etc. Dean From: Bryce Godfrey bryce.godf...@azaleos.commailto:bryce.godf...@azaleos.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, October 25, 2012 11:44 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: High bandwidth usage between datacenters for cluster We have a 5 node cluster, with a matching 5 nodes for DR in another data center. With a replication factor of 3, does the node I send a write too attempt to send it to the 3 servers in the DR also? Or does it send it to 1 and let it replicate locally in the DR environment to save bandwidth across the WAN? Normally this isn’t an issue for us, but at times we are writing approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic across the WAN to all the Cassandra DR servers. If my assumptions are right, is this configurable somehow for writing to one node and letting it do local replication? We are on 1.1.5 Thanks
Re: High bandwidth usage between datacenters for cluster
Use placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and also fill the topology.properties file. This will tell cassandra that you have two DCs. You can verify that by looking at output of the ring command. If you DCs are setup properly, only one request will go over WAN. Though the responses from all nodes in other DC will go over WAN. On Thu, Oct 25, 2012 at 10:44 AM, Bryce Godfrey bryce.godf...@azaleos.comwrote: We have a 5 node cluster, with a matching 5 nodes for DR in another data center. With a replication factor of 3, does the node I send a write too attempt to send it to the 3 servers in the DR also? Or does it send it to 1 and let it replicate locally in the DR environment to save bandwidth across the WAN? Normally this isn’t an issue for us, but at times we are writing approximately 1MB a sec of data, and seeing a corresponding 3MB of traffic across the WAN to all the Cassandra DR servers. ** ** If my assumptions are right, is this configurable somehow for writing to one node and letting it do local replication? We are on 1.1.5 ** ** Thanks
Re: Large results and network round trips
I dont have any sample data on this, but read latency will depend on these 1) Consistency level of the read 2) Disk speed. Also you can look at the Netflix client as it makes the co-ordinator node same as the node which holds that data. This will reduce one hop. On Thu, Oct 25, 2012 at 9:04 AM, Edward Capriolo edlinuxg...@gmail.comwrote: Hello all, Currently we implement wide rows for most of our entities. For example: user { event1=x event2=y event3=z ... } Normally the entires are bounded to be less then 256 columns and most columns are small in size say 30 bytes. Because the blind write nature of Cassandra it is possible the column family can get much larger. We have very low latency requirements for example say less then (5ms). Considering network rountrip and all other factors I am wondering what is the largest column that is possible in a 5ms window on a GB network. First we have our thrift limits 15MB, is it possible even in the best case scenario to deliver a 15MB response in under 5ms on a GigaBit ethernet for example? Does anyone have any real world numbers with reference to package sizes and standard performance? Thanks all, Edward
Re: constant CMS GC using CPU time
On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.comwrote: This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Yes, I think that was at least part of the issue. The rough numbers I've used to estimate working set are: * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB * memtable size 1024 MB * index sampling: * 24 bytes + key (16 bytes for UUID) = 32 bytes * 400M / 128 default sampling = 3,125,000 * 3,125,000 * 32 = 95 MB * java fudge X5 or X10 = 475MB to 950MB * ignoring row cache and key cache So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. On a 5120 MB heap, with 800MB new you have roughly 4300 MB tenured (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. These values seem reasonable and in line with what I was seeing. There are other CF and apps sharing this cluster but this one was the largest. You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off. I was using jstat to monitor gc activity and some snippets from that are in my original email in this thread. The key behavior was that full gc was running pretty often and never able to reclaim much (if any) space. There are a few things you could try: * increase the JVM heap by say 1Gb and see how it goes * increase bloom filter false positive, try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance ) * increase index_interval sampling in yaml. * decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. * disable caches or ensure off heap caches are used. I've done several of these already in addition to changing the app to reduce the number of rows retained. How does compaction_throughput relate to memory usage? I assumed that was more for IO tuning. I noticed that lowering concurrent_compactors to 4 (from default of 8) lowered the memory used during compactions. in_memory_compaction_limit_in_mb seems to only be used for wide rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb. My multithreaded_compaction is still false. Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. Finally this book is a good reference on Java GC http://amzn.com/0137142528 For my understanding what was the average row size for the 400 million keys ? The compacted row mean size for the CF is 8815 (as reported by cfstats) but that comes out to be much larger than the real load per node I was seeing. Each node had about 200GB of data for the CF with 4 nodes in the cluster and RF=3. At the time, the TTL for all columns was 3 days and gc_grace_seconds was 5 days. Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so the number of rows and data dropped to a level it can handle. -Bryan
Re: Large results and network round trips
For this scenario, remove disk speed from the equation. Assume the row is completely in Row Cache. Also lets assume Read.ONE. With this information I would be looking to determine response size/maximum requests second/max latency. I would use this to say You want to do 5,000 reads/sec, on a GigaBit ethernet, and each row is 10K, in under 5ms latency Sorry that is impossible. On Thu, Oct 25, 2012 at 2:58 PM, sankalp kohli kohlisank...@gmail.com wrote: I dont have any sample data on this, but read latency will depend on these 1) Consistency level of the read 2) Disk speed. Also you can look at the Netflix client as it makes the co-ordinator node same as the node which holds that data. This will reduce one hop. On Thu, Oct 25, 2012 at 9:04 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hello all, Currently we implement wide rows for most of our entities. For example: user { event1=x event2=y event3=z ... } Normally the entires are bounded to be less then 256 columns and most columns are small in size say 30 bytes. Because the blind write nature of Cassandra it is possible the column family can get much larger. We have very low latency requirements for example say less then (5ms). Considering network rountrip and all other factors I am wondering what is the largest column that is possible in a 5ms window on a GB network. First we have our thrift limits 15MB, is it possible even in the best case scenario to deliver a 15MB response in under 5ms on a GigaBit ethernet for example? Does anyone have any real world numbers with reference to package sizes and standard performance? Thanks all, Edward