Re: one big cluster vs multiple smaller clusters
Hi guys, Thanks for your reply. It's very helpful. I agree with Plotnik on the scaling part. For the business logic, it sounds obvious that it make sense to divide, i.e. metadata and really BIG data into different clusters, of course. as you mentioned. But after I think about it a bit more, what is the real reason for that if the cluster can be scaled horizontally? Each node still has the same amount of data, what is the benefit of having a separate cluster? For analytics application, it has to be on a separate cluster as Paulo pointed out. But if all of use cases are for web application (as of now) what are the drawbacks to put everything into one big cluster? Thanks. -Wei On Monday, October 14, 2013 4:15 AM, Paulo Motta pauloricard...@gmail.com wrote: By clusters do you mean data centers? If so, I think it depends on your use case and application requirements. For instance, if you have a web application and and analytics application (hadoop), you would want to separate your cluster in 2 different data centers (even if they're located in the same physical zone). If it's just one application then you can start with one data center and add more data centers later if needed. 2013/10/14 Plotnik, Alexey aplot...@rhonda.ru If you are talking about scaling: Cassandra scaling is absolutely horizontal without namenodes or other Mongo-bulshit-like intermediate daemons. And that’s why one big cluster has the same throughput as many smaller clusters. What will you do when your small clusters will exceed it’s capacity? Cassandra is designed for very large data so feel free to utilize it’s capabilities. If you are talking in terms of business logic: it make sense to divide, i.e. metadata and really BIG data into different clusters, of course. From:Wz1975 [mailto:wz1...@yahoo.com] Sent: 14 октября 2013 г. 7:20 To: user@cassandra.apache.org Subject: Re: one big cluster vs multiple smaller clusters Importance: Low we have choices of making one big cluster vs a few small clusters. I am trying to get pros and cons for both options in genera. Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: one big cluster vs multiple smaller clusters From: Jon Haddad j...@jonhaddad.com To: user@cassandra.apache.org CC: This is a pretty vague question. What are you trying to achieve? On Oct 12, 2013, at 9:05 PM, Wei Zhu wz1...@yahoo.com wrote: Hi, As we bring more use cases to Cassandra, we have been thinking about the best way to host it. Let's say we will have 15 physical machines available, we can use all of them to form a big cluster or divide them into 3 clusters with 5 nodes each. As we will deploy to 1.2, it becomes easier to expand the cluster with vnodes. I really don't see any good reasons to make 3 smaller clusters. Did I miss anything obvious? Thanks. -Wei -- Paulo Ricardo -- European Master in Distributed Computing Royal Institute of Technology - KTH Instituto Superior Técnico - IST http://paulormg.com
one big cluster vs multiple smaller clusters
Hi, As we bring more use cases to Cassandra, we have been thinking about the best way to host it. Let's say we will have 15 physical machines available, we can use all of them to form a big cluster or divide them into 3 clusters with 5 nodes each. As we will deploy to 1.2, it becomes easier to expand the cluster with vnodes. I really don't see any good reasons to make 3 smaller clusters. Did I miss anything obvious? Thanks. -Wei
Re: sstable size change
what is output of show keyspaces from cassandra-cli, did you see the new value? Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy Compaction Strategy Options: sstable_size_in_mb: XXX From: Keith Wright kwri...@nanigans.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Wednesday, July 24, 2013 3:44 PM Subject: Re: sstable size change Hi all, This morning I increased the SSTable size for one of my LCS via an alter command and saw at least one compaction run (I did not trigger a compaction via nodetool nor upgrades stables nor removing the .json file). But so far my data sizes appear the same at the default 5 MB (see below for output of ls –Sal as well as relevant portion of cfstats). Is this expected? I was hoping to see at least one file at the new 256 MB size I set. Thanks SSTable count: 4965 SSTables in each level: [0, 10, 112/100, 1027/1000, 3816, 0, 0, 0] Space used (live): 29062393142 Space used (total): 29140547702 Number of Keys (estimate): 195103104 Memtable Columns Count: 441483 Memtable Data Size: 205486218 Memtable Switch Count: 243 Read Count: 154226729 -rw-rw-r-- 1 cassandra cassandra 5247564 Jul 18 01:33 users-shard_user_lookup-ib-97153-Data.db -rw-rw-r-- 1 cassandra cassandra 5247454 Jul 23 02:59 users-shard_user_lookup-ib-109063-Data.db -rw-rw-r-- 1 cassandra cassandra 5247421 Jul 20 14:58 users-shard_user_lookup-ib-103127-Data.db -rw-rw-r-- 1 cassandra cassandra 5247415 Jul 17 13:56 users-shard_user_lookup-ib-95761-Data.db -rw-rw-r-- 1 cassandra cassandra 5247379 Jul 21 02:44 users-shard_user_lookup-ib-104718-Data.db -rw-rw-r-- 1 cassandra cassandra 5247346 Jul 21 21:54 users-shard_user_lookup-ib-106280-Data.db -rw-rw-r-- 1 cassandra cassandra 5247242 Jul 3 19:41 users-shard_user_lookup-ib-66049-Data.db -rw-rw-r-- 1 cassandra cassandra 5247235 Jul 21 02:44 users-shard_user_lookup-ib-104737-Data.db -rw-rw-r-- 1 cassandra cassandra 5247233 Jul 20 14:58 users-shard_user_lookup-ib-103169-Data.db From: sankalp kohli kohlisank...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Tuesday, July 23, 2013 3:04 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: sstable size change Will Cassandra force any newly compacted files to my new setting as compactions are naturally triggered Yes. Let it compact and increase in size. On Tue, Jul 23, 2013 at 9:38 AM, Robert Coli rc...@eventbrite.com wrote: On Tue, Jul 23, 2013 at 6:48 AM, Keith Wright kwri...@nanigans.com wrote: Can you elaborate on what you mean by let it take its own course organically? Will Cassandra force any newly compacted files to my new setting as compactions are naturally triggered? You see, when two (or more!) SSTables love each other very much, they sometimes decide they want to compact together.. But seriously, yes. If you force all existing SSTables to level 0, it is as if you just flushed them all. Level compaction then does a whole lot of compaction, using the active table size. =Rob
Re: AssertionError: Unknown keyspace?
I have got bitten by it once. At least there should be a message saying, there is no streaming data since it's a seed node. I searched the source code, the message was there and it got removed at certain version. -Wei From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org Sent: Monday, June 24, 2013 10:34 AM Subject: Re: AssertionError: Unknown keyspace? On Mon, Jun 24, 2013 at 6:04 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Oh shoot, this is a seed node. Is there documentation on how to bootstrap a seed node? If I have seeds of A, B, C for every machine on the ring and I am bootstrapping node B, do I just modify cassandra.yaml and remove node B from the yaml file temporarily and boot it up Yes. The only thing that makes a node fail that check is being in its own seed list. But if the node is in other nodes' seed lists, those nodes will contact it anyway. This strongly implies that the contains() check there is the wrong test, but I've never nailed that down and/or filed a ticket on it. Conversation at the summit suggests I should, making a note to do so... =Rob
Re: AssertionError: Unknown keyspace?
Here is the line in the source code for 1.1.0: https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/service/StorageService.java#L549 And it's refactored later to this, and the message was removed. https://github.com/apache/cassandra/blob/cassandra-1.2.0/src/java/org/apache/cassandra/service/StorageService.java#L549 -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Monday, June 24, 2013 12:04:10 PM Subject: Re: AssertionError: Unknown keyspace? Yes, it would be nice at startup just to say don't list your seed node as this node and then fail out and we would have known this a long long time ago ;). Dean From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Date: Monday, June 24, 2013 12:36 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: AssertionError: Unknown keyspace? I have got bitten by it once. At least there should be a message saying, there is no streaming data since it's a seed node. I searched the source code, the message was there and it got removed at certain version. -Wei From: Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Sent: Monday, June 24, 2013 10:34 AM Subject: Re: AssertionError: Unknown keyspace? On Mon, Jun 24, 2013 at 6:04 AM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: Oh shoot, this is a seed node. Is there documentation on how to bootstrap a seed node? If I have seeds of A, B, C for every machine on the ring and I am bootstrapping node B, do I just modify cassandra.yaml and remove node B from the yaml file temporarily and boot it up Yes. The only thing that makes a node fail that check is being in its own seed list. But if the node is in other nodes' seed lists, those nodes will contact it anyway. This strongly implies that the contains() check there is the wrong test, but I've never nailed that down and/or filed a ticket on it. Conversation at the summit suggests I should, making a note to do so... =Rob
Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change
I think the new SSTable will be in the new size. In order to do that, you need to trigger a compaction so that the new SSTables will be generated. for LCS, there is no major compaction though. You can run a nodetool repair and hopefully you will bring some new SSTables and compactions will kick in. Or you can change the $CFName.json file under your data directory and move every SSTable to level 0. You need to stop your node, write a simple script to alter that file and start the node again. I think it will be helpful to have a nodetool command to change the SSTable Size and trigger the rebuild of the SSTables. Thanks. -Wei - Original Message - From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org Sent: Friday, June 21, 2013 4:51:29 PM Subject: Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change On Fri, Jun 21, 2013 at 4:40 PM, Andrew Bialecki andrew.biale...@gmail.com wrote: However when we run alter the column family and then run nodetool upgradesstables -a keyspace columnfamily, the files in the data directory have been re-written, but the file sizes are the same. Is this the expected behavior? If not, what's the right way to upgrade them. If this is expected, how can we benchmark the read/write performance with varying sstable sizes. It is expected, upgradesstables/scrub/clean compactions work on a single sstable at a time, they are not capable of combining or splitting them. In theory you could probably : 1) start out with the largest size you want to test 2) stop your node 3) use sstable_split [1] to split sstables 4) start node, test 5) repeat 2-4 I am not sure if there is anything about level compaction which makes this infeasible. =Rob [1] https://github.com/pcmanus/cassandra/tree/sstable_split
Re: Data not fully replicated with 2 nodes and replication factor 2
I don't think you can fully trust hintedhandoff, it's more like we are trying our best to deliver it but no guarantee. Even if the hints are guaranteed to be delivered and there will be a delay which is supposed to be part of eventual consistency paradigm. If you want enforce real consistency, change your consistency level. Or do a repair. Thanks. -Wei - Original Message - From: James Lee james@metaswitch.com To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com, rc...@eventbrite.com Sent: Thursday, June 20, 2013 3:21:30 AM Subject: RE: Data not fully replicated with 2 nodes and replication factor 2 Rob, Wei, thank you both for your responses - from what Rob says below my test is a valid one. I've run some additional tests and observed the following: -- I mentioned before that some of the initial writes initially failed and then succeed when the test tool retries them. I've checked that there's no correlation between the keys for writes which required a retry and the keys for the failed reads (i.e. the reads are failing for keys that were written fine at the first attempt). -- I've retried this test but limiting the rate of initial writes to be much lower (from 8000/s down to 2000/s). This makes the problem go away completely: no more read failures. So it seems like I have exposed a genuine bug in Cassandra replication which manifests under high write load. What's the best next step - should I be filing a bug report, and if so what diagnostics are likely to be useful? Thanks, James Lee -Original Message- From: Robert Coli [mailto:rc...@eventbrite.com] Sent: 19 June 2013 20:59 To: user@cassandra.apache.org; Wei Zhu Subject: Re: Data not fully replicated with 2 nodes and replication factor 2 On Wed, Jun 19, 2013 at 11:43 AM, Wei Zhu wz1...@yahoo.com wrote: I think hints are only stored when the other node is down, not on the dropped mutations. (Correct me if I am wrong, actually it's not a bad idea to store hints for dropped mutations and replay them later?) This used to be the way it worked pre-1.0... https://issues.apache.org/jira/browse/CASSANDRA-2034 In modern cassandra, anything but a successful ack from a coordinated write results in a hint on the coordinator. To solve your issue, as I mentioned, either do nodetool repair, or increase your consistency level. By the way, you probably write faster than your cluster can handle if you see that many dropped mutations. If his hints are ultimately delivered, OP should not need repair to be consistent. =Rob
Re: Heap is not released and streaming hangs at 0%
If you want, you can try to force the GC through Jconsole. Memory-Perform GC. It theoretically triggers a full GC and when it will happen depends on the JVM -Wei - Original Message - From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org Sent: Tuesday, June 18, 2013 10:43:13 AM Subject: Re: Heap is not released and streaming hangs at 0% On Tue, Jun 18, 2013 at 10:33 AM, srmore comom...@gmail.com wrote: But then shouldn't JVM C G it eventually ? I can still see Cassandra alive and kicking but looks like the heap is locked up even after the traffic is long stopped. No, when GC system fails this hard it is often a permanent failure which requires a restart of the JVM. nodetool -h localhost flush didn't do much good. This adds support to the idea that your heap is too full, and not full of memtables. You could try nodetool -h localhost invalidatekeycache, but that probably will not free enough memory to help you. =Rob
Re: Data not fully replicated with 2 nodes and replication factor 2
You have a lot of Dropped Mutations which means those writes might not go through. Since you have CL.ONE as write consistency, your client doesn't see the exception if write fails only on one node. I think hints are only stored when the other node is down, not on the dropped mutations. (Correct me if I am wrong, actually it's not a bad idea to store hints for dropped mutations and replay them later?) To solve your issue, as I mentioned, either do nodetool repair, or increase your consistency level. By the way, you probably write faster than your cluster can handle if you see that many dropped mutations. -Wei - Original Message - From: James Lee james@metaswitch.com To: user@cassandra.apache.org Sent: Wednesday, June 19, 2013 2:22:39 AM Subject: RE: Data not fully replicated with 2 nodes and replication factor 2 The test tool I am using catches any exceptions on the original writes and resubmits the write request until it's successful (bailing out after 5 failures). So for each key Cassandra has reported a successful write. Nodetool says the following - I'm guessing the pending hinted handoff is the interesting bit? comet-mvs01:/dsc-cassandra-1.2.2# ./bin/nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 35445 0 0 RequestResponseStage 0 0 1535171 0 0 MutationStage 0 0 3038941 0 0 ReadRepairStage 0 0 2695 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 2898 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage 0 0 245 0 0 MemtablePostFlusher 0 0 1260 0 0 FlushWriter 0 0 633 0 212 MiscStage 0 0 0 0 0 commitlog_archiver 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 1 1 0 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 BINARY 0 READ 0 MUTATION 60427 _TRACE 0 REQUEST_RESPONSE 0 Looking at the hints column family in the system keyspace, I see one row with a large number of columns. Presumably that along with the nodetool output above suggests there are hinted handoffs pending? How long should I expect these to remain for? Ah, actually now that I re-run the command it seems that nodetool now reports that hint as completed and there are no hints left in the system keyspace on either node. I'm still seeing failures to read the data I'm expecting though, as before. Note that I've run this with a smaller data set (2M rows, 1GB data total) for this latest test. Thanks, James -Original Message- From: Robert Coli [mailto:rc...@eventbrite.com] Sent: 18 June 2013 19:45 To: user@cassandra.apache.org Subject: Re: Data not fully replicated with 2 nodes and replication factor 2 On Tue, Jun 18, 2013 at 11:36 AM, Wei Zhu wz1...@yahoo.com wrote: Cassandra doesn't do async replication like HBase does.You can run nodetool repair to insure the consistency. While this answer is true, it is somewhat non-responsive to the OP. If the OP didn't see timeout exception, the theoretical worst case is that he should have hints stored for initially failed to replicate writes. His nodes should not be failing GC with a total data size of 5gb on an 8gb heap, so those hints should deliver quite quickly. After 30 minutes those hints should certainly be delivered. @OP : do you see hints being stored? does nodetool tpstats indicate dropped messages? =Rob
Re: Data not fully replicated with 2 nodes and replication factor 2
Rob, Thanks. I was not aware of that. So we can avoid repair if there is no hardware failure...I found a blog: http://www.datastax.com/dev/blog/modern-hinted-handoff -Wei - Original Message - From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Wednesday, June 19, 2013 12:58:45 PM Subject: Re: Data not fully replicated with 2 nodes and replication factor 2 On Wed, Jun 19, 2013 at 11:43 AM, Wei Zhu wz1...@yahoo.com wrote: I think hints are only stored when the other node is down, not on the dropped mutations. (Correct me if I am wrong, actually it's not a bad idea to store hints for dropped mutations and replay them later?) This used to be the way it worked pre-1.0... https://issues.apache.org/jira/browse/CASSANDRA-2034 In modern cassandra, anything but a successful ack from a coordinated write results in a hint on the coordinator. To solve your issue, as I mentioned, either do nodetool repair, or increase your consistency level. By the way, you probably write faster than your cluster can handle if you see that many dropped mutations. If his hints are ultimately delivered, OP should not need repair to be consistent. =Rob
Re: Data not fully replicated with 2 nodes and replication factor 2
Cassandra doesn't do async replication like HBase does.You can run nodetool repair to insure the consistency. Or you can increase your Read or Write consistency. As long as R + W RF, you have strong consistency. In your case, you can use CL.TWO for either read and write. -Wei - Original Message - From: James Lee james@metaswitch.com To: user@cassandra.apache.org Sent: Tuesday, June 18, 2013 5:02:53 AM Subject: Data not fully replicated with 2 nodes and replication factor 2 Hello, I’m seeing a strange problem with a 2-node Cassandra test deployment, where it seems that data isn’t being replicated among the nodes as I would expect. I suspect this may be a configuration issue of some kind, but have been unable to figure what I should change. The setup is as follows: · Two Cassandra nodes in the cluster (they each have themselves and the other node as seeds in cassandra.yaml). · Create 40 keyspaces, each with simple replication strategy and replication factor 2. · Populate 125,000 rows into each keyspace, using a pycassa client with a connection pool pointed at both nodes (I’ve verified that pycassa does indeed send roughly half the writes to each node). These are populated with writes using consistency level of 1. · Wait 30 minutes (to give replications a chance to complete). · Do random reads of the rows in the keyspaces, again using a pycassa client with a connection pool pointed at both nodes. These are read using consistency level 1. I’m finding that the vast majority of reads are successful, but a small proportion (~0.1%) are returned as Not Found. If I manually try to look up those keys using cassandra-cli, I see that they are returned when querying one of the nodes, but not when querying the other. So it seems like some of the rows have simply not been replicated. I’m not sure how I can monitor the status of ongoing replications, but the system has been idle for many 10s of minutes and the total database size is only about 5GB, so I don’t think there are any further ongoing operations. Any suggestions? In case it’s relevant, my setup is: · Cassandra 1.2.2, running on Linux · Sun Java 1.7.0_10-b18 64-bit · Java heap settings: -Xms8192M -Xmx8192M -Xmn2048M Thank you, James Lee
Re: Large number of files for Leveled Compaction
default value of 5MB is way too small in practice. Too many files in one directory is not a good thing. It's not clear what should be a good number. I have heard people are using 50MB, 75MB, even 100MB. Do your own test o find a right number. -Wei - Original Message - From: Franc Carter franc.car...@sirca.org.au To: user@cassandra.apache.org Sent: Sunday, June 16, 2013 10:15:22 PM Subject: Re: Large number of files for Leveled Compaction On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali mainalima...@gmail.com wrote: Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller sstables into large ones. With the LeveledCompaction, the sstables are always of fixed size but they are grouped into different levels. You can refer to this page http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on details of how LeveledCompaction works. Yes, but it seems I've misinterpreted that page ;-( I took this paragraph blockquote In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4. /blockquote to mean that once a level fills up it gets compacted into a higher level cheers blockquote Cheers Manoj On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter franc.car...@sirca.org.au wrote: blockquote On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali mainalima...@gmail.com wrote: blockquote With LeveledCompaction, each sstable size is fixed and is defined by sstable_size_in_mb in the compaction configuration of CF definition and default value is 5MB. In you case, you may have not defined your own value, that is why your each sstable is 5MB. And if you dataset is huge, you will see a lot of sstable counts. /blockquote Ok, seems like I do have (at least) an incomplete understanding. I realise that the minimum size is 5MB, but I thought compaction would merge these into a smaller number of larger sstables ? thanks blockquote Cheers Manoj On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter franc.car...@sirca.org.au wrote: blockquote Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? cheers -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 /blockquote /blockquote -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 /blockquote /blockquote -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: Large number of files for Leveled Compaction
Correction, the largest I heard is 256MB SSTable size. - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Sunday, June 16, 2013 10:28:25 PM Subject: Re: Large number of files for Leveled Compaction default value of 5MB is way too small in practice. Too many files in one directory is not a good thing. It's not clear what should be a good number. I have heard people are using 50MB, 75MB, even 100MB. Do your own test o find a right number. -Wei - Original Message - From: Franc Carter franc.car...@sirca.org.au To: user@cassandra.apache.org Sent: Sunday, June 16, 2013 10:15:22 PM Subject: Re: Large number of files for Leveled Compaction On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali mainalima...@gmail.com wrote: Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller sstables into large ones. With the LeveledCompaction, the sstables are always of fixed size but they are grouped into different levels. You can refer to this page http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on details of how LeveledCompaction works. Yes, but it seems I've misinterpreted that page ;-( I took this paragraph blockquote In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4. /blockquote to mean that once a level fills up it gets compacted into a higher level cheers blockquote Cheers Manoj On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter franc.car...@sirca.org.au wrote: blockquote On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali mainalima...@gmail.com wrote: blockquote With LeveledCompaction, each sstable size is fixed and is defined by sstable_size_in_mb in the compaction configuration of CF definition and default value is 5MB. In you case, you may have not defined your own value, that is why your each sstable is 5MB. And if you dataset is huge, you will see a lot of sstable counts. /blockquote Ok, seems like I do have (at least) an incomplete understanding. I realise that the minimum size is 5MB, but I thought compaction would merge these into a smaller number of larger sstables ? thanks blockquote Cheers Manoj On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter franc.car...@sirca.org.au wrote: blockquote Hi, We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it may be a win for us. The first step of testing was to push a fairly large slab of data into the Column Family - we did this much faster ( x100) than we would in a production environment. This has left the Column Family with about 140,000 files in the Column Family directory which seems way too high. On two of the nodes the CompactionStats show 2 outstanding tasks and on a third node there are over 13,000 outstanding tasks. However from looking at the log activity it looks like compaction has finished on all nodes. Is this number of files expected/normal ? cheers -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 /blockquote /blockquote -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 /blockquote /blockquote -- Franc Carter | Systems architect | Sirca Ltd franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 8355 2514 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215
Re: High performance disk io
For us, the biggest killer is repair and compaction following repair. If you are running VNodes, you need to test the performance while running repair. - Original Message - From: Igor i...@4friends.od.ua To: user@cassandra.apache.org Sent: Wednesday, May 22, 2013 7:48:34 AM Subject: Re: High performance disk io On 05/22/2013 05:41 PM, Christopher Wirt wrote: Hi Igor, Yea same here, 15ms for 99 th percentile is our max. Currently getting one or two ms for most CF. It goes up at peak times which is what we want to avoid. Our 99 percentile also goes up at peak times but stay at acceptable level. blockquote We’re using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. Needed to be .NET so Hector and Astyanax were not options. /blockquote Astyanax is token-aware, so we avoid extra data hops between cassandra nodes. blockquote Do you use SSDs or multiple SSDs in any kind of configuration or RAID? /blockquote No, single SSD per host blockquote Thanks Chris From: Igor [ mailto:i...@4friends.od.ua ] Sent: 22 May 2013 15:07 To: user@cassandra.apache.org Subject: Re: High performance disk io Hello What level of read performance do you expect? We have limit 15 ms for 99 percentile with average read latency near 0.9ms. For some CF 99 percentile actually equals to 2ms, for other - to 10ms, this depends on the data volume you read in each query. Tuning read performance involved cleaning up data model, tuning cassandra.yaml, switching from Hector to astyanax, tuning OS parameters. On 05/22/2013 04:40 PM, Christopher Wirt wrote: blockquote Hello, We’re looking at deploying a new ring where we want the best possible read performance. We’ve setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb SATA for OS and commitlog Three column families ColFamily1 50% of the load and data ColFamily2 35% of the load and data ColFamily3 15% of the load and data At the moment we are still seeing around 20% disk utilisation and occasionally as high as 40/50% on some nodes at peak time.. we are conducting some semi live testing. CPU looks fine, memory is fine, keycache hit rate is about 80% (could be better, so maybe we should be increasing the keycache size?) Anyway, we’re looking into what we can do to improve this. One conversion we are having at the moment is around the SSD disk setup.. We are considering moving to have 3 smaller SSD drives and spreading the data across those. The possibilities are: -We have a RAID0 of the smaller SSDs and hope that improves performance. Will this acutally yield better throughput? -We mount the SSDs to different directories and define multiple data directories in Cassandra.yaml. Will not having a layer of RAID controller improve the throughput? -We mount the SSDs to different columns family directories and have a single data directory declared in Cassandra.yaml. Think this is quite attractive idea. What are the drawbacks? System column families will be on the main SATA? -We don’t change anything and just keep upping our keycache. -Anything you guys can think of. Ideas and thoughts welcome. Thanks for your time and expertise. Chris /blockquote /blockquote
Re: High performance disk io
without VNodes, during repair -pr, it will stream data for all the replicates and repair all of them. So it will impact RF number of nodes. In the case of VNodes, the streaming/compaction should happen to all the physical nodes. I heard the repair is even worse for VNodes Test it and see how it goes. -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Wednesday, May 22, 2013 12:19:44 PM Subject: Re: High performance disk io If you are only running repair on one node, should it not skip that node? So there should be no performance hit except when doing CL_ALL of course. We had to make a change to cassandra or slow nodes did impact us previously. Dean From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Date: Wednesday, May 22, 2013 1:16 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: High performance disk io For us, the biggest killer is repair and compaction following repair. If you are running VNodes, you need to test the performance while running repair. From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Sent: Wednesday, May 22, 2013 7:48:34 AM Subject: Re: High performance disk io On 05/22/2013 05:41 PM, Christopher Wirt wrote: Hi Igor, Yea same here, 15ms for 99th percentile is our max. Currently getting one or two ms for most CF. It goes up at peak times which is what we want to avoid. Our 99 percentile also goes up at peak times but stay at acceptable level. We’re using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. Needed to be .NET so Hector and Astyanax were not options. Astyanax is token-aware, so we avoid extra data hops between cassandra nodes. Do you use SSDs or multiple SSDs in any kind of configuration or RAID? No, single SSD per host Thanks Chris From: Igor [mailto:i...@4friends.od.ua] Sent: 22 May 2013 15:07 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: High performance disk io Hello What level of read performance do you expect? We have limit 15 ms for 99 percentile with average read latency near 0.9ms. For some CF 99 percentile actually equals to 2ms, for other - to 10ms, this depends on the data volume you read in each query. Tuning read performance involved cleaning up data model, tuning cassandra.yaml, switching from Hector to astyanax, tuning OS parameters. On 05/22/2013 04:40 PM, Christopher Wirt wrote: Hello, We’re looking at deploying a new ring where we want the best possible read performance. We’ve setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb SATA for OS and commitlog Three column families ColFamily1 50% of the load and data ColFamily2 35% of the load and data ColFamily3 15% of the load and data At the moment we are still seeing around 20% disk utilisation and occasionally as high as 40/50% on some nodes at peak time.. we are conducting some semi live testing. CPU looks fine, memory is fine, keycache hit rate is about 80% (could be better, so maybe we should be increasing the keycache size?) Anyway, we’re looking into what we can do to improve this. One conversion we are having at the moment is around the SSD disk setup.. We are considering moving to have 3 smaller SSD drives and spreading the data across those. The possibilities are: -We have a RAID0 of the smaller SSDs and hope that improves performance. Will this acutally yield better throughput? -We mount the SSDs to different directories and define multiple data directories in Cassandra.yaml. Will not having a layer of RAID controller improve the throughput? -We mount the SSDs to different columns family directories and have a single data directory declared in Cassandra.yaml. Think this is quite attractive idea. What are the drawbacks? System column families will be on the main SATA? -We don’t change anything and just keep upping our keycache. -Anything you guys can think of. Ideas and thoughts welcome. Thanks for your time and expertise. Chris
Re: any way to get the #writes/second, reads per second
We have a long running script which wakes up every minute to get reads/writes through JMX. It does the calculation to get r/s and w/s and send them to ganglia. We are thinking of using graphite which comes with some sort of intelligence mentioned by Tomàs, but it's just too big of a change for our infrastructure. -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Tuesday, May 14, 2013 4:37:14 AM Subject: Re: any way to get the #writes/second, reads per second Cool, thanks, Dean From: Tomàs Núnez tomas.nu...@groupalia.commailto:tomas.nu...@groupalia.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, May 14, 2013 4:53 AM To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: any way to get the #writes/second, reads per second Yes, there isn't any place in the JMX with reads/second or writes/second, just CompletedTasks. I send this info to Graphite (http://graphite.wikidot.com/) and use the derivative function to get reads/minute. Munin also does the trick (https://github.com/tcurdt/jmx2munin). But you can't do that with cassandra itself, you need somewhere to make the calculations (cacti is also a good match). Hope that helps. There is some more information about monitoring this kind of things here: http://www.tomas.cat/blog/en/monitoring-cassandra-relevant-data-should-be-watched-and-how-send-it-graphite 2013/5/13 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We running a pretty consistent load on our cluster and added a new node to a 6 node cluster Friday(QA worked great, but production not so much). One mistake that was made was starting up the new node, then disabling the firewall :( which allowed nodes to discover it BEFORE the node bootstrapped itself. We shutdown the node and booted him up and he bootstrapped himself streaming all the data in. After that though, all the ndoes have really really high load numbers now. We are trying to figure out what is going on still. Is there any way to get the number of reads/second and writes/second through JMX or something? The only way I can see of on doing this is manually calculating it by timing the read count and dividing by my manual stop watches start/stop times(timerange). Also, while my load is load average: 20.31, 19.10, 19.72 , what does a normal iostat look like? My iostat await time is 13.66 ms which I think is kind of bad, but not that bad to cause a load of 20.31? Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.02 0.07 11.70 1.96 1353.67 702.88 150.58 0.19 13.66 3.61 4.93 sdb 0.00 0.02 0.11 0.46 20.72 97.54 206.70 0.00 1.33 0.67 0.04 Thanks, Dean -- [cid:part1.00030401.01090104@groupalia.com]http://es.groupalia.com/ www.groupalia.comhttp://es.groupalia.com/ Tomàs Núñez IT-Sysprod Tel. + 34 93 159 31 00 Fax. + 34 93 396 18 52 Llull, 95-97, 2º planta, 08005 Barcelona Skype: tomas.nunez.groupalia tomas.nu...@groupalia.commailto:nombre.apell...@groupalia.com [cid:part2.06060004.09060102@groupalia.com] Twitterhttp://twitter.com/#%21/groupaliaes Facebookhttps://www.facebook.com/GroupaliaEspana [cid:part4.03040807.03080505@groupalia.com] Linkedinhttp://www.linkedin.com/company/groupalia
Re: (unofficial) Community Poll for Production Operators : Repair
1) 1.1.6 on 5 nodes, 24CPU, 72 RAM 2) local quorum (we only have one DC though). We do delete through TTL 3) yes 4) once a week rolling repairs -pr using cron job 5) it definitely has negative impact on the performance. Our data size is around 100G per node and during repair it brings in additional 60G - 80G data and created about 7K compaction (We use LCS with SSTable size of 10M which was a mistake we made at the beginning). It takes more than a day for the compaction tasks to clear and by then the next compaction starts. We had to set client side (Hector) timeout to deal with it and the SLA is still under control for now. But we had to halt go live for another cluster due to the unanticipated double the space during the repair. Per Dean's question to simulate the slow response, someone in the IRC mentioned a trick to start Cassandra with -f and ctrl-z and it works for our test. -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Tuesday, May 14, 2013 4:48:02 AM Subject: Re: (unofficial) Community Poll for Production Operators : Repair We had to roll out a fix in cassandra as a slow node was slowing down our clients of cassandra in 1.2.2 for some reason. Every time we had a slow node, we found out fast as performance degraded. We tested this in QA and had the same issue. This means a repair made that node slow which made our clients slow. With this fix which I think one our team is going to try to get it back into cassandra, the slow node does not affect our clients anymore. I am curious though, if someone else would use the tc program to simulate linux packet delay on a single node, does your client's response time get much slower? We simulated a 500ms delay on the node to simulate the slow node….it seems the co-ordinator node was incorrectly waiting for BOTH responses on CL_QUOROM instead of just one (as itself was one as well) or something like that. (I don't know too much as my colleague was the one that debugged this issue) Dean From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, May 14, 2013 1:42 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: (unofficial) Community Poll for Production Operators : Repair Hi Rob, 1) 1.2.2 on 6 to 12 EC2 m1.xlarge 2) Quorum RW . Almost no deletes (just some TTL) 3) Yes 4) On each node once a week (rolling repairs using crontab) 5) The only behavior that is quite odd or unexplained to me is why a repair doesn't fix a counter mismatch between 2 nodes. I mean when I read my counters with a CL.One I have inconsistency (the counter value may change anytime I read it, depending, I guess, on what node I read from. Reading with CL.Quorum fixes this bug, but the data is still wrong on some nodes. About performance, it's quite expensive to run a repair but doing it in a low charge period and in a rolling fashion works quite well and has no impact on the service. Hope this will help somehow. Let me know if you need more information. Alain 2013/5/10 Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com Hi! I have been wondering how Repair is actually used by operators. If people operating Cassandra in production could answer the following questions, I would greatly appreciate it. 1) What version of Cassandra do you run, on what hardware? 2) What consistency level do you write at? Do you do DELETEs? 3) Do you run a regularly scheduled repair? 4) If you answered yes to 3, what is the frequency of the repair? 5) What has been your subjective experience with the performance of repair? (Does it work as you would expect? Does its overhead have a significant impact on the performance of your cluster?) Thanks! =Rob
Re: compaction throughput rate not even close to 16MB
Same here. We disable the throttling and our disk and CPU usage both low ( 10%) and still takes hours for LCS compaction to finish after a repair. For this cluster, we don't delete any data, so we can rule out tombstones. Not sure what is holding compaction back. My observation is that for the LCS which involves large number of SSTables (since we set SSTable size too small at 10M and sometimes one compactions involves up to 10 G of data = 1000 SSTables), the throughout put is smaller. So my theory is that open/close file handlers have substantial impact on the throughput. By the way, we are on SSD. -Wei From: Hiller, Dean dean.hil...@nrel.gov To: user@cassandra.apache.org user@cassandra.apache.org Sent: Wednesday, April 24, 2013 1:37 PM Subject: Re: compaction throughput rate not even close to 16MB Thanks much!!! Better to hear at least one other person sees the same thing ;). Sometimes these posts just go silent. Dean From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, April 24, 2013 2:33 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: compaction throughput rate not even close to 16MB I have noticed the same. I think in the real world your compaction throughput is limited by other things. If I had to speculate I would say that compaction can remove expired tombstones, however doing this requires bloom filter checks, etc. I think that setting is more important with multi threaded compaction and/or more compaction slots. In those cases it may actually throttle something. On Wed, Apr 24, 2013 at 3:54 PM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: I was wondering about the compactionthroughput. I never see ours get even close to 16MB and I thought this is supposed to throttle compaction, right? Ours is constantly less than 3MB/sec from looking at our logs or do I have this totally wrong? How can I see the real throughput so that I can understand how to throttle it when I need to? 94,940,780 bytes to 95,346,024 (~100% of original) in 38,438ms = 2.365603MB/s. 2,350,114 total rows, 2,350,022 unique. Row merge counts were {1:2349930, 2:92, } Thanks, Dean
move data from Cassandra 1.1.6 to 1.2.4
Hi, We are trying to upgrade from 1.1.6 to 1.2.4, it's not really a live upgrade. We are going to retire the old hardware and bring in a set of new hardware for 1.2.4. For old cluster, we have 5 nodes with RF = 3, total of 1TB data. For new cluster, we will have 10 nodes with RF = 3. We will use VNodes. What is the best way to bring the data from 1.1.6 to 1.2.4? A couple of concerns: * We also use LCS and plan to increase SSTable size. * We use randomPartitioner, we should stick with it, not to mess up with murmur3? Thanks for your feedback. -Wei
Re: move data from Cassandra 1.1.6 to 1.2.4
Hi Dean, It's a bit different case for us. We will have a set of new machines to replace the old ones and we want to migrate those data over. I would imagine to do something like * Let new nodes (with VNodes) join the cluster * decommission the old nodes. (Without VNodes) Thanks. -Wei From: Hiller, Dean dean.hil...@nrel.gov To: user@cassandra.apache.org user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Tuesday, April 23, 2013 11:17 AM Subject: Re: move data from Cassandra 1.1.6 to 1.2.4 We went from 1.1.4 to 1.2.2 and in QA rolling restart failed but in production and QA bringing down the whole cluster upgrading every node and then bringing it back up worked fine. We left ours at randompartitioner and had LCS as well. We did not convert to Vnodes at all. Don't know if it helps at all, but it is a similar case I would think. Dean From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Date: Tuesday, April 23, 2013 12:11 PM To: Cassandr usergroup user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: move data from Cassandra 1.1.6 to 1.2.4 Hi, We are trying to upgrade from 1.1.6 to 1.2.4, it's not really a live upgrade. We are going to retire the old hardware and bring in a set of new hardware for 1.2.4. For old cluster, we have 5 nodes with RF = 3, total of 1TB data. For new cluster, we will have 10 nodes with RF = 3. We will use VNodes. What is the best way to bring the data from 1.1.6 to 1.2.4? A couple of concerns: * We also use LCS and plan to increase SSTable size. * We use randomPartitioner, we should stick with it, not to mess up with murmur3? Thanks for your feedback. -Wei
Re: How to make compaction run faster?
We have tried very hard to speed up lcs on 1.1.6 with no luck. It seems to be single threaded and not much parallelism you can achieve. 1.2 does come with parallel lcs which should help. One more thing to try is to enlarge the sstable size which will reduce the number of SSTable. It *might* help the lcs. -Wei - Original Message - From: Alexis Rodríguez arodrig...@inconcertcc.com To: user@cassandra.apache.org Sent: Thursday, April 18, 2013 11:03:13 AM Subject: Re: How to make compaction run faster? Jay, await , according to iostat's man page it is the time of a request to the disk to get served. You may try changing the io scheduler. I've read that noop it's recommended for SSDs, you can check here http://goo.gl/XMiIA Regarding compaction, a week ago we had serious problems with compaction in a test machine, solved by changing from openjdk 1.6 to sun-jdk 1.6. On Thu, Apr 18, 2013 at 2:08 PM, Jay Svc jaytechg...@gmail.com wrote: By the way the compaction and commit log disk latency, these are two seperate problems I see. The important one is compaction problem, How I can speed that up? Thanks, Jay On Thu, Apr 18, 2013 at 12:07 PM, Jay Svc jaytechg...@gmail.com wrote: blockquote Looks like formatting is bit messed up. Please let me know if you want the same in clean format. Thanks, Jay On Thu, Apr 18, 2013 at 12:05 PM, Jay Svc jaytechg...@gmail.com wrote: blockquote Hi Aaron, Alexis, Thanks for reply, Please find some more details below. Core problems: Compaction is taking longer time to finish. So it will affect my reads. I have more CPU and memory, want to utilize that to speed up the compaction process. Parameters used: 1. SSTable size: 500MB (tried various sizes from 20MB to 1GB) 2. Compaction throughput mb per sec: 250MB (tried from 16MB to 640MB) 3. Concurrent write: 196 (tried from 32 to 296) 4. Concurrent compactors: 72 (tried disabling to making it 172) 5. Multithreaded compaction: true (tried both true and false) 6. Compaction strategy: LCS (tried STCS as well) 7. Memtable total space in mb: 4096 MB (tried default and some other params too) Note: I have tried almost all permutation combination of these parameters. Observations: I ran test for 1.15 hrs with writes at the rate of 21000 records/sec(total 60GB data during 1.15 hrs). After I stopped the test compaction took additional 1.30 hrs to finish compaction, that reduced the SSTable count from 170 to 17. CPU(24 cores): almost 80% idle during the run JVM: 48G RAM, 8G Heap, (3G to 5G heap used) Pending Writes: sometimes high spikes for small amount of time otherwise pretty flat Aaron, Please find the iostat below: the sdb and dm-2 are the commitlog disks. Please find the iostat of some of 3 different boxes in my cluster. -bash-4.1$ iostat -xkcd Linux 2.6.32-358.2.1.el6.x86_64 (edc-epod014-dl380-3) 04/18/2013 _x86_64_ (24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.20 1.11 0.59 0.01 0.00 97.09 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.03 416.56 9.00 7.08 1142.49 1694.55 352.88 0.07 4.08 0.57 0.92 sdb 0.00 172.38 0.08 3.34 10.76 702.89 416.96 0.09 24.84 0.94 0.32 dm-0 0.00 0.00 0.03 0.75 0.62 3.00 9.24 0.00 1.45 0.33 0.03 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.74 0.68 0.00 dm-2 0.00 0.00 0.08 175.72 10.76 702.89 8.12 3.26 18.49 0.02 0.32 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 7.97 0.00 0.83 0.62 0.00 dm-4 0.00 0.00 8.99 422.89 1141.87 1691.55 13.12 4.64 10.71 0.02 0.90 -bash-4.1$ iostat -xkcd Linux 2.6.32-358.2.1.el6.x86_64 (ndc-epod014-dl380-1) 04/18/2013 _x86_64_ (24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.20 1.12 0.52 0.01 0.00 97.14 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svc sda 0.01 421.17 9.22 7.43 1167.81 1714.38 346.10 0.07 3.99 0. sdb 0.00 172.68 0.08 3.26 10.52 703.74 427.79 0.08 25.01 0. dm-0 0.00 0.00 0.04 1.04 0.89 4.16 9.34 0.00 2.58 0. dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.77 0. dm-2 0.00 0.00 0.08 175.93 10.52 703.74 8.12 3.13 17.78 0. dm-3 0.00 0.00 0.00 0.00 0.00 0.00 7.97 0.00 1.14 0. dm-4 0.00 0.00 9.19 427.55 1166.91 1710.21 13.18 4.67 10.65 0. -bash-4.1$ iostat -xkcd Linux 2.6.32-358.2.1.el6.x86_64 (edc-epod014-dl380-1) 04/18/2013 _x86_64_ (24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.15 1.13 0.52 0.01 0.00 97.19 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.02 429.97 9.28 7.29 1176.81 1749.00 353.12 0.07 4.10 0.55 0.91 sdb 0.00 173.65 0.08 3.09 10.50 706.96 452.25 0.09 27.23 0.99 0.31 dm-0 0.00 0.00 0.04 0.79 0.82 3.16 9.61 0.00 1.54 0.27 0.02 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.68 0.63 0.00 dm-2 0.00 0.00 0.08 176.74 10.50 706.96 8.12 3.46 19.53 0.02 0.31 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 7.97 0.00 0.85 0.83 0.00 dm-4 0.00 0.00 9.26 436.46 1175.98 1745.84 13.11
Re: lots of extra bytes on disk
Hi Ben, If affordable, just blow away the node and bootstrap in a replacement/ or restore from snapshot and repair. -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Thursday, March 28, 2013 11:40:21 AM Subject: Re: lots of extra bytes on disk Oh and since our LCS was 10MB per file it was easy to tell which files did not convert yet. Also, we ended up blowing away a CF on node 5(of 6) and running a full repair on that CF and after he was at a normal size again as well. Dean On 3/28/13 12:35 PM, Hiller, Dean dean.hil...@nrel.gov wrote: We had a runaway STCS like this due to our own mistakes but were not sure how to clean it up. We went to LCS instead of STCS and that seemed to bring it way back down since the STCS had repeats and such between SSTables which LCS avoids mostly. I can't help much more than that info though. Dean On 3/28/13 12:31 PM, Ben Chobot be...@instructure.com wrote: Sorry to make it confusing. I didn't have snapshots on some nodes; I just made a snapshot on a node with this problem. So to be clear, on this one example node Cassandra reports ~250GB of space used In a CF data directory (before snapshots existed), du -sh showed ~550GB After the snapshot, du in the same directory still showed ~550GB (they're hard links, so that's correct) du in the snapshot directory for that CF shows ~250GB, and ls shows ~50 fewer files. On Mar 28, 2013, at 11:10 AM, Hiller, Dean wrote: I am confused. I thought you said you don't have a snapshot. Df/du reports space used by existing data AND the snapshot. Cassandra only reports on space used by actual dataif you move the snapshots, does df/du match what cassandra says? Dean On 3/28/13 12:05 PM, Ben Chobot be...@instructure.com wrote: .though interestingly, the snapshot of these CFs have the right amount of data in them (i.e. it agrees with the live SSTable size reported by cassandra). Is it total insanity to remove the files from the data directory not included in the snapshot, so long as they were created before the snapshot? On Mar 28, 2013, at 10:54 AM, Hiller, Dean wrote: Have you cleaned up your snapshotsŠthose take extra space and don't just go away unless you delete them. Dean On 3/28/13 11:46 AM, Ben Chobot be...@instructure.com wrote: Are you also running 1.1.5? I'm wondering (ok hoping) that this might be fixed if I upgrade. On Mar 28, 2013, at 8:53 AM, Lanny Ripple wrote: We occasionally (twice now on a 40 node cluster over the last 6-8 months) see this. My best guess is that Cassandra can fail to mark an SSTable for cleanup somehow. Forced GC's or reboots don't clear them out. We disable thrift and gossip; drain; snapshot; shutdown; clear data/Keyspace/Table/*.db and restore (hard-linking back into place to avoid data transfer) from the just created snapshot; restart. On Mar 28, 2013, at 10:12 AM, Ben Chobot be...@instructure.com wrote: Some of my cassandra nodes in my 1.1.5 cluster show a large discrepancy between what cassandra says the SSTables should sum up to, and what df and du claim exist. During repairs, this is almost always pretty bad, but post-repair compactions tend to bring those numbers to within a few percent of each other... usually. Sometimes they remain much further apart after compactions have finished - for instance, I'm looking at one node now that claims to have 205GB of SSTables, but actually has 450GB of files living in that CF's data directory. No pending compactions, and the most recent compaction for this CF finished just a few hours ago. nodetool cleanup has no effect. What could be causing these extra bytes, and how to get them to go away? I'm ok with a few extra GB of unexplained data, but an extra 245GB (more than all the data this node is supposed to have!) is a little extreme.
Re: bloom filter fp ratio of 0.98 with fp_chance of 0.01
sstables after changing the FP chance Thanks! Andras ---BeginMessage--- I'm still wondering about how to chose the size of the sstable under LCS. Default is 5MB, people use to configure it to 10MB and now you configure it at 128MB. What are the benefits or disadvantages of a very small size (let's say 5 MB) vs big size (like 128MB) ? This seems to be the biggest question about LCS, and it is still unanswered. Does anyone (commiters maybe) know about it ? It would help at least us 5, but probably more people. Alain 2013/3/8 Michael Theroux mthero...@yahoo.com I've asked this myself in the past... fairly arbitrarily chose 10MB based on Wei's experience, -Mike On Mar 8, 2013, at 1:50 PM, Hiller, Dean wrote: +1 (I would love to know this info). Dean From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Date: Friday, March 8, 2013 11:11 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Size Tiered - Leveled Compaction I have the same wonder. We started with the default 5M and the compaction after repair takes too long on 200G node, so we increase the size to 10M sort of arbitrarily since there is not much documentation around it. Our tech op team still thinks there are too many files in one directory. To fulfill the guidelines from them (don't remember the exact number, but something in the range of 50K files), we will need to increase the size to around 50M. I think the latency of opening one file is not impacted much by the number of files in one directory for the modern file system. But ls and other operations suffer. Anyway, I asked about the side effect of the bigger SSTable in IRC, someone was mentioning during read, C* reads the whole SSTable from disk in order to access the row which causes more disk IO compared with the smaller SSTable. I don't know enough about the internal of the Cassandra, not sure whether it's the case or not. If that is the case (with question mark) , the SSTable or the row is kept in the memory? Hope someone can confirm the theory here. Or I have to dig in to the source code to find it. Another concern is during repair, does it stream the whole SSTable or the partial of it when mismatch is detected? I see the claim for both, can someone please confirm also? The last thing is the effectiveness of the parallel LCS on 1.2. It takes quite some time for the compaction to finish after repair for LCS for 1.1.X. Both CPU and disk Util is low during the compaction which means LCS doesn't fully utilized resource. It will make the life easier if the issue is addressed in 1.2. Bottom line is that there is not much documentation/guideline/successful story around LCS although it sounds beautiful on paper. Thanks. -Wei From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Sent: Friday, March 8, 2013 1:25 AM Subject: Re: Size Tiered - Leveled Compaction I'm still wondering about how to chose the size of the sstable under LCS. Defaul is 5MB, people use to configure it to 10MB and now you configure it at 128MB. What are the benefits or inconveniants of a very small size (let's say 5 MB) vs big size (like 128MB) ? Alain 2013/3/8 Al Tobey a...@ooyala.commailto:a...@ooyala.com We saw the exactly the same thing as Wei Zhu, 100k tables in a directory causing all kinds of issues. We're running 128MiB ssTables with LCS and have disabled compaction throttling. 128MiB was chosen to get file counts under control and reduce the number of files C* has to manage search. I just looked and a ~250GiB node is using about 10,000 files, which is quite manageable. This configuration is running smoothly in production under mixed read/write load. We're on RAID0 across 6 15k drives per machine. When we migrated data to this cluster we were pushing well over 26k/s+ inserts with CL_QUORUM. With compaction throttling enabled at any rate it just couldn't keep up. With throttling off, it runs smoothly and does not appear to have an impact on our applications, so we always leave it off, even in EC2. An 8GiB heap is too small for this config on 1.1. YMMV. -Al Tobey On Thu, Feb 14, 2013 at 12:51 PM, Wei Zhu wz1...@yahoo.commailto: wz1...@yahoo.com wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost
Re: nodetool repair hung?
check nodetool tpstats and looking for AntiEntropySessions/AntiEntropyStages grep the log and looking for repair and merkle tree - Original Message - From: S C as...@outlook.com To: user@cassandra.apache.org Sent: Monday, March 25, 2013 2:55:30 PM Subject: nodetool repair hung? I am using Cassandra 1.1.5. nodetool repair is not coming back on the command line. Did it ran successfully? Did it hang? How do you find if the repair was successful? I did not find anything in the logs.nodetool compactionstats and nodetool netstats are clean. nodetool compactionstats pending tasks: 0 Active compaction remaining time : n/a nodetool netstats Mode: NORMAL Not sending any streams. Not receiving any streams. Pool Name Active Pending Completed Commands n/a 0 121103621 Responses n/a 0 209564496
Re: High disk I/O during reads
According to your cfstats, read latency is over 100 ms which is really really slow. I am seeing less than 3ms reads for my cluster which is on SSD. Can you also check the nodetool cfhistorgram, it tells you more about the number of SSTable involved and read/write latency. Somtimes average doesn't tell you the whole storey. Also check your nodetool tpstats, are there a lot dropped reads? -Wei - Original Message - From: Jon Scarborough j...@fifth-aeon.net To: user@cassandra.apache.org Sent: Friday, March 22, 2013 9:42:34 AM Subject: Re: High disk I/O during reads Key distribution across probably varies a lot from row to row in our case. Most reads would probably only need to look at a few SSTables, a few might need to look at more. I don't yet have a deep understanding of C* internals, but I would imagine even the more expensive use cases would involve something like this: 1) Check the index for each SSTable to determine if part of the row is there. 2) Look at the endpoints of the slice to determine if the data in a particular SSTable is relevant to the query. 3) Read the chunks of those SSTables, working backwards from the end of the slice until enough columns have been read to satisfy the limit clause in the query. So I would have guessed that even the more expensive queries on wide rows typically wouldn't need to read more than a few hundred KB from disk to do all that. Seems like I'm missing something major. Here's the complete CF definition, including compression settings: CREATE COLUMNFAMILY conversation_text_message ( conversation_key bigint PRIMARY KEY ) WITH comment='' AND comparator='CompositeType(org.apache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.AsciiType,org.apache.cassandra.db.marshal.AsciiType)' AND read_repair_chance=0.10 AND gc_grace_seconds=864000 AND default_validation=text AND min_compaction_threshold=4 AND max_compaction_threshold=32 AND replicate_on_write=True AND compaction_strategy_class='SizeTieredCompactionStrategy' AND compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompressor'; Much thanks for any additional ideas. -Jon On Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Did you mean to ask are 'all' your keys spread across all SSTables? I am guessing at your intention. I mean I would very well hope my keys are spread across all sstables or otherwise that sstable should not be there as he has no keys in it ;). And I know we had HUGE disk size from the duplication in our sstables on size-tiered compaction….we never ran a major compaction but after we switched to LCS, we went from 300G to some 120G or something like that which was nice. We only have 300 data point posts / second so not an extreme write load on 6 nodes as well though these posts causes read to check authorization and such of our system. Dean From: Kanwar Sangha kan...@mavenir.com mailto: kan...@mavenir.com Reply-To: user@cassandra.apache.org mailto: user@cassandra.apache.org user@cassandra.apache.org mailto: user@cassandra.apache.org Date: Friday, March 22, 2013 8:38 AM To: user@cassandra.apache.org mailto: user@cassandra.apache.org user@cassandra.apache.org mailto: user@cassandra.apache.org Subject: RE: High disk I/O during reads Are your Keys spread across all SSTables ? That will cause every sstable read which will increase the I/O. What compaction are you using ? From: zod...@fifth-aeon.net mailto: zod...@fifth-aeon.net [mailto: zod...@fifth-aeon.net ] On Behalf Of Jon Scarborough Sent: 21 March 2013 23:00 To: user@cassandra.apache.org mailto: user@cassandra.apache.org Subject: High disk I/O during reads Hello, We've had a 5-node C* cluster (version 1.1.0) running for several months. Up until now we've mostly been writing data, but now we're starting to service more read traffic. We're seeing far more disk I/O to service these reads than I would have anticipated. The CF being queried consists of chat messages. Each row represents a conversation between two people. Each column represents a message. The column key is composite, consisting of the message date and a few other bits of information. The CF is using compression. The query is looking for a maximum of 50 messages between two dates, in reverse order. Usually the two dates used as endpoints are 30 days ago and the current time. The query in Astyanax looks like this: ColumnListConversationTextMessageKey result = keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE) .setConsistencyLevel(ConsistencyLevel.CL_QUORUM) .getKey(conversationKey) .withColumnRange( textMessageSerializer.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(), textMessageSerializer.makeEndpoint(startDate, Equality.GREATER_THAN_EQUALS).toBytes(), true, maxMessages) .execute() .getResult(); We're currently servicing around 30 of
Re: cannot start Cassandra on Windows7
It's there: http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning#node-init-config It's a long document You need to look at the cassandra.yaml and cassandra-env.sh and make sure you understand the settings there. By the way, did datastax just face lift their document web site? It looks nice. -Wei - Original Message - From: Marina ppi...@yahoo.com To: user@cassandra.apache.org Sent: Friday, March 22, 2013 9:18:46 AM Subject: Re: cannot start Cassandra on Windows7 Jabbar Azam ajazam at gmail.com writes: Oops, I also had opscenter installed on my PC. My changes log4j-server.properties file log4j.appender.R.File=c:/var/log/cassandra/system.log cassandra.yaml file# directories where Cassandra should store data on disk.data_file_directories: - c:/var/lib/cassandra/data# commit logcommitlog_directory: c:/var/lib/cassandra/commitlog# saved cachessaved_caches_directory: c:/var/lib/cassandra/saved_caches I also added an environment variable for windows called CASSANDRA_HOME I needed to do this for one of my colleagues and now it's documented ;) Thanks, Jabbar, Victor, Yes, after I made similar changes I was able to start Cassandra too. Would be nice if these instructions were included with the main Cassandra documentation/WIKI :) Thanks! Marina On 22 March 2013 15:47, Jabbar Azam ajazam at gmail.com wrote: Viktor, you're right. I didn't get any errors on my windows console but cassandra.yaml and log4j-server.properties need modifying. On 22 March 2013 15:44, Viktor Jevdokimov Viktor.Jevdokimov at adform.com wrote:You NEED to edit cassandra.yaml and log4j-server.properties paths before starting on Windows. There're a LOT of things to learn for starters. Google for Cassandra on Windows. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: Viktor.Jevdokimov at adform.com Phone: +370 5 212 3063 Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. -Original Message- From: Marina [mailto:ppine7 at yahoo.com] Sent: Friday, March 22, 2013 17:21 To: user at cassandra.apache.org Subject: cannot start Cassandra on Windows7 Hi, I have downloaded apache-cassandra-1.2.3-bin.tar.gz and un-zipped it on my Windows7 machine (I did not find a Windows-specific distributable...). Then, I tried to start Cassandra as following and got an error: C:\Marina\Tools\apache-cassandra-1.2.3\bincassandra.bat -f Starting Cassandra Server Exception in thread main java.lang.ExceptionInInitializerError Caused by: java.lang.RuntimeException: Couldn't figure out log4j configuration: log4j-server.properties at org.apache.cassandra.service.CassandraDaemon.initLog4j(CassandraDaemo n.java:81) at org.apache.cassandra.service.CassandraDaemon.clinit(CassandraDaemon .java:57) Could not find the main class: org.apache.cassandra.service.CassandraDaemon. Pr ogram will exit. C:\Marina\Tools\apache-cassandra-1.2.3\bin It looks similar to the Cassandra issue that was already fixed: https://issues.apache.org/jira/browse/CASSANDRA-2383 however I am still getting this error I am an Administrator on my machine, and have access to all files in the apache- cassandra-1.2.3\conf dir, including the log4j ones. Do I need to configure anything else on Winows ? I did not find any Windows- specific installation/setup/startup instructions - if there are such documents somewhere, please let me know! Thanks, Marina -- ThanksJabbar Azam -- ThanksJabbar Azam
Re: Stream fails during repair, two nodes out-of-memory
compaction needs some disk I/O. Slowing down our compaction will improve overall system performance. Of course, you don't want to go too slow and fall behind too much. -Wei - Original Message - From: Dane Miller d...@optimalsocial.com To: user@cassandra.apache.org Cc: Wei Zhu wz1...@yahoo.com Sent: Friday, March 22, 2013 4:12:56 PM Subject: Re: Stream fails during repair, two nodes out-of-memory On Thu, Mar 21, 2013 at 10:28 AM, aaron morton aa...@thelastpickle.com wrote: heap of 1867M is kind of small. According to the discussion on this list, it's advisable to have m1.xlarge. +1 In cassadrea-env.sh set the MAX_HEAP_SIZE to 4GB, and the NEW_HEAP_SIZE to 400M In the yaml file set in_memory_compaction_limit_in_mb to 32 compaction_throughput_mb_per_sec to 8 concurrent_compactors to 2 This will slow down compaction a lot. You may want to restore some of these settings once you have things stable. You have an under powered box for what you are trying to do. Thanks very much for the info. Have made the changes and am retrying. I'd like to understand, why does it help to slow compaction? It does seem like the cluster is under powered to handle our application's full write load plus repairs, but it operates fine otherwise. On Wed, Mar 20, 2013 at 8:47 PM, Wei Zhu wz1...@yahoo.com wrote: It's clear you are out of memory. How big is your data size? 120 GB per node, of which 50% is actively written/updated, and 50% is read-mostly. Dane
Re: Stream fails during repair, two nodes out-of-memory
;) ). One obvious reason is administrating a 24 node cluster does add person-time overhead. Another reason includes less impact of maintenance activities such as repair, as these activites have significant CPU overhead. Doubling the cluster size would, in theory, halve the time for this overhead, but would still impact performance during that time. Going to xlarge would lessen the impact of these activities on operations. Anything else? Thanks, -Mike On Mar 14, 2013, at 9:27 AM, aaron morton wrote: Because of this I have an unstable cluster and have no other choice than use Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes. m1.xlarge is a MUCH better choice than m1.large. You get more ram and better IO and less steal. Using half as many m1.xlarge is the way to go. My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the max 8 GB (crashing the node). How is it crashing ? Are you getting too much GC or running OOM ? Are you using the default GC configuration ? Is cassandra logging a lot of GC warnings ? If you are running OOM then something has to change. Maybe bloom filters, maybe caches. Enable the GC logging in cassandra-env.sh to check how low a CMS compaction get's the heap, or use some other tool. That will give an idea of how much memory you are using. Here is some background on what is kept on heap in pre 1.2 http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 13/03/2013, at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote: Here is the JIRA I submitted regarding the ancestor. https://issues.apache.org/jira/browse/CASSANDRA-5342 -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:35:29 AM Subject: Re: About the heap Hi Dean, The index_interval is controlling the sampling of the SSTable to speed up the lookup of the keys in the SSTable. Here is the code: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478 To increase the interval meaning, taking less samples, less memory, slower lookup for read. I did do a heap dump on my production system which caused about 10 seconds pause of the node. I found something interesting, for LCS, it could involve thousands of SSTables for one compaction, the ancestors are recorded in case something goes wrong during the compaction. But those are never removed after the compaction is done. In our case, it takes about 1G of heap memory to store that. I am going to submit a JIRA for that. Here is the culprit: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58 Enjoy looking at Cassandra code:) -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:11:14 AM Subject: Re: About the heap Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which gave us smaller bloomfilters. As far as key cache. There is an entry in cassandra.yaml called index_interval set to 128. I am not sure if that is related to key_cache. I think it is. By turning that to 512 or maybe even 1024, you will consume less ram there as well though I ran this test in QA and my key cache size stayed the same so I am really not sure(I am actually checking out cassandra code now to dig a little deeper into this property. Dean From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, March 13, 2013 10:11 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: About the heap Hi, I would like to know everything that is in the heap. We are here speaking of C*1.1.6 Theory : - Memtable (1024 MB) - Key Cache (100 MB) - Row Cache (disabled, and serialized with JNA activated anyway, so should be off-heap) - BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter Space Used and considering they are showed in Bytes - 1103765112) - Anything else ? So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly (from the new BF of my new data). My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the max 8 GB (crashing the node). Because of this I have an unstable cluster and have no other choice than use Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes. What am I missing ? Practice : Is there a way not inducing any load and easy
Re: How to configure linux service for Cassandra
Are you looking for something like this http://www.centos.org/docs/5/html/Deployment_Guide-en-US/s1-services-chkconfig.html Thanks. -Wei - Original Message - From: Jason Kushmaul | WDA jason.kushm...@wda.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Tuesday, March 19, 2013 5:58:37 AM Subject: RE: How to configure linux service for Cassandra I'm not sure about the Cent OS version, but you could utilize the hard work that datastax has done with their community edition RPMs, an init script is installed for you. -Original Message- From: Roshan [mailto:codeva...@gmail.com] Sent: Tuesday, March 19, 2013 7:10 AM To: cassandra-u...@incubator.apache.org Subject: How to configure linux service for Cassandra Hi I want to start the cassandra as a service. At the moment it is starting as a background task. Cassandra version: 1.0.11 OS: CentOS 5.X Any help is much appreciated. Thanks. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-to-configure-linux-service-for-Cassandra-tp7586474.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Recovering from a faulty cassandra node
Hi Dean, If you are not using VNode and try to replace the node, use the new token as old token -1, not +1. The reason is that, the assignment of token is clock wise along the ring. If you set your new token to be old token -1, the new node will take over all the data of the old node except for one token which was assigned to the old node. If you assign new token to be old token + 1, then the new node will only streame data of one token. So as a good practice, don't set 0 as your node token, start with 100. So it's easier to go down from 100 than go down from 0 (need to caculate 2 ^ 127 - 1) Hope I didn't confuse you. -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Tuesday, March 19, 2013 8:25:25 AM Subject: Re: Recovering from a faulty cassandra node I have not done this as of yet but from all that I have read your best option is to follow the replace node documentation which I belive you need to 1. Have the token be the same BUT add 1 to it so it doesn't think it's the same computer 2. Have the bootstrap option set or something so streaming takes affect. I would however test that all out in QA to make sure it works and if you have QUOROM reads/writes a good part of that test would be to take node X down after your node Y is back in the cluster to make sure reads/writes are working on the node you fixed…..you just need to make sure node X shares one of the token ranges of node Y AND your writes/reads are in that token range. Dean From: Jabbar Azam aja...@gmail.commailto:aja...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, March 19, 2013 8:51 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Recovering from a faulty cassandra node Hello, I am using Cassandra 1.2.2 on a 4 node test cluster with vnodes. I waited for over a week to insert lots of data into the cluster. During the end of the process one of the nodes had a hardware fault. I have fixed the hardware fault but the filing system on that node is corrupt so I'll have to reinstall the OS and cassandra. I can think of two ways of reintegrating the host into the cluster 1) shrink the cluster to three nodes and add the node into the cluster 2) Add the node into the cluster without shrinking I'm not sure of the best approach to take and I'm not sure how to achieve each step. Can anybody help? -- Thanks Jabbar Azam
Re: Truncate behaviour
There is setting in the cassandra.yaml file which controls that. # Whether or not a snapshot is taken of the data before keyspace truncation # or dropping of column families. The STRONGLY advised default of true # should be used to provide data safety. If you set this flag to false, you will # lose data on truncation or drop. auto_snapshot: true - Original Message - From: Víctor Hugo Oliveira Molinar vhmoli...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, March 19, 2013 11:50:35 AM Subject: Truncate behaviour Hello guys! I'm researching the behaviour for truncate operations at cassandra. Reading the oficial wiki page( http://wiki.apache.org/cassandra/API ) we can understand it as: Removes all the rows from the given column family. And reading the DataStax page( http://www.datastax.com/docs/1.0/references/cql/TRUNCATE ) we can understand it as: A TRUNCATE statement results in the immediate, irreversible removal of all data in the named column family. But I think there is a missing and important point about truncate operations. At least at 1.2.0 version, whenever I run a truncate operation, C* automatically creates a snapshot file of the column family, resulting in a fake free disk space. I'm intentionally mentioning 'fake free disk space' because I only figured it out when the machine disk space was at high usage. - Is it a security C* behaviour of creating snapshots for each CF before truncate operation? - In my scenario I need to purge my column family data every day. I thought that truncate could handle it based at the docs. But it doesnt. And since I dont want to manually delete those snapshots, I'd like to know if there is a safe and practical way to perform a daily purge of this CF data. Thanks in advance!
Re: 13k pending compaction tasks but ZERO running?
Did you restart the node? As I can tell compactions start a few minutes after restarting. Did you see a file called $CFName.json ($CFName is your cf name) in your data directory? -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Thursday, March 14, 2013 8:36:00 AM Subject: Re: 13k pending compaction tasks but ZERO running? Duh me. I forgot to mention I ran nodetool compact k cf and it was done in 45 seconds and I still had a 36G file(darn, I am usually better about putting the detail in my emailsŠI thought I added that). I also went into JMX and then did it there. I ran it again and it took 15 seconds. My script is as follows so that I know the timesŠit's too bad nodetool doesn't log times for all tools they have. If I have time, I should just do a pull request on that. #!/bin/bash date compacttimes.txt nodetool compact databus5 nreldata compacttimes.txt date compacttimes.txt Thanks, Dean On 3/14/13 9:26 AM, Michael Theroux mthero...@yahoo.com wrote: Hi Dean, I saw the same behavior when we switched from STCS to LCS on a couple of our tables. Not sure why it doesn't proceed immediately (I pinged the list, but didn't get any feedback). However, running nodetool compact keyspace table got things moving for me. -Mike On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote: How do I get my node to run through the 13k pending compaction tasks? I had to use iptables to take the ring out of the cluster for now and he is my only node still on STCS. In cassandra-cli, it shows LCS but on disk, I see a 36Gig file(ie. Must be STCS still). How can I get the 13k pending tasks to start running? Nodetool compactionstats Š. pending tasks: 13793 Active compaction remaining time :n/a Thanks, Dean
Re: 13k pending compaction tasks but ZERO running?
No problem. Back to the old trick, doesn't work, restart:) From: Hiller, Dean dean.hil...@nrel.gov To: user@cassandra.apache.org user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Thursday, March 14, 2013 9:53 AM Subject: Re: 13k pending compaction tasks but ZERO running? Ah, you are a lifesaver. I was so used to keeping the nodes always up. That worked It is finally taking affect. Thanks, Dean On 3/14/13 10:47 AM, Wei Zhu wz1...@yahoo.com wrote: Did you restart the node? As I can tell compactions start a few minutes after restarting. Did you see a file called $CFName.json ($CFName is your cf name) in your data directory? -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Thursday, March 14, 2013 8:36:00 AM Subject: Re: 13k pending compaction tasks but ZERO running? Duh me. I forgot to mention I ran nodetool compact k cf and it was done in 45 seconds and I still had a 36G file(darn, I am usually better about putting the detail in my emailsŠI thought I added that). I also went into JMX and then did it there. I ran it again and it took 15 seconds. My script is as follows so that I know the timesŠit's too bad nodetool doesn't log times for all tools they have. If I have time, I should just do a pull request on that. #!/bin/bash date compacttimes.txt nodetool compact databus5 nreldata compacttimes.txt date compacttimes.txt Thanks, Dean On 3/14/13 9:26 AM, Michael Theroux mthero...@yahoo.com wrote: Hi Dean, I saw the same behavior when we switched from STCS to LCS on a couple of our tables. Not sure why it doesn't proceed immediately (I pinged the list, but didn't get any feedback). However, running nodetool compact keyspace table got things moving for me. -Mike On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote: How do I get my node to run through the 13k pending compaction tasks? I had to use iptables to take the ring out of the cluster for now and he is my only node still on STCS. In cassandra-cli, it shows LCS but on disk, I see a 36Gig file(ie. Must be STCS still). How can I get the 13k pending tasks to start running? Nodetool compactionstats Š. pending tasks: 13793 Active compaction remaining time : n/a Thanks, Dean
Re: About the heap
Hi Dean, The index_interval is controlling the sampling of the SSTable to speed up the lookup of the keys in the SSTable. Here is the code: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478 To increase the interval meaning, taking less samples, less memory, slower lookup for read. I did do a heap dump on my production system which caused about 10 seconds pause of the node. I found something interesting, for LCS, it could involve thousands of SSTables for one compaction, the ancestors are recorded in case something goes wrong during the compaction. But those are never removed after the compaction is done. In our case, it takes about 1G of heap memory to store that. I am going to submit a JIRA for that. Here is the culprit: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58 Enjoy looking at Cassandra code:) -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:11:14 AM Subject: Re: About the heap Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which gave us smaller bloomfilters. As far as key cache. There is an entry in cassandra.yaml called index_interval set to 128. I am not sure if that is related to key_cache. I think it is. By turning that to 512 or maybe even 1024, you will consume less ram there as well though I ran this test in QA and my key cache size stayed the same so I am really not sure(I am actually checking out cassandra code now to dig a little deeper into this property. Dean From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, March 13, 2013 10:11 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: About the heap Hi, I would like to know everything that is in the heap. We are here speaking of C*1.1.6 Theory : - Memtable (1024 MB) - Key Cache (100 MB) - Row Cache (disabled, and serialized with JNA activated anyway, so should be off-heap) - BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter Space Used and considering they are showed in Bytes - 1103765112) - Anything else ? So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly (from the new BF of my new data). My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the max 8 GB (crashing the node). Because of this I have an unstable cluster and have no other choice than use Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes. What am I missing ? Practice : Is there a way not inducing any load and easy to do to dump the heap to analyse it with MAT (or anything else that you could advice) ? Alain
Re: About the heap
It's not BloomFilter. Cassandra will read through sstable index files on start-up, doing what is known as index sampling. This is used to keep a subset (currently and by default, 1 out of 100) of keys and and their on-disk location in the index, in memory. See ArchitectureInternals. This means that the larger the index files are, the longer it takes to perform this sampling. Thus, for very large indexes (typically when you have a very large number of keys) the index sampling on start-up may be a significant issue. http://wiki.apache.org/cassandra/LargeDataSetConsiderations -Wei - Original Message - From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:28:28 AM Subject: Re: About the heap called index_interval set to 128 I think this is for BloomFilters actually. 2013/3/13 Hiller, Dean dean.hil...@nrel.gov Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which gave us smaller bloomfilters. As far as key cache. There is an entry in cassandra.yaml called index_interval set to 128. I am not sure if that is related to key_cache. I think it is. By turning that to 512 or maybe even 1024, you will consume less ram there as well though I ran this test in QA and my key cache size stayed the same so I am really not sure(I am actually checking out cassandra code now to dig a little deeper into this property. Dean From: Alain RODRIGUEZ arodr...@gmail.com mailto: arodr...@gmail.com Reply-To: user@cassandra.apache.org mailto: user@cassandra.apache.org user@cassandra.apache.org mailto: user@cassandra.apache.org Date: Wednesday, March 13, 2013 10:11 AM To: user@cassandra.apache.org mailto: user@cassandra.apache.org user@cassandra.apache.org mailto: user@cassandra.apache.org Subject: About the heap Hi, I would like to know everything that is in the heap. We are here speaking of C*1.1.6 Theory : - Memtable (1024 MB) - Key Cache (100 MB) - Row Cache (disabled, and serialized with JNA activated anyway, so should be off-heap) - BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter Space Used and considering they are showed in Bytes - 1103765112) - Anything else ? So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly (from the new BF of my new data). My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the max 8 GB (crashing the node). Because of this I have an unstable cluster and have no other choice than use Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes. What am I missing ? Practice : Is there a way not inducing any load and easy to do to dump the heap to analyse it with MAT (or anything else that you could advice) ? Alain
Re: repair hangs
Do you see anything related to merkle tree in your log? Also do a nodetool compactionstats, during merkle tree calculation, you will see validation there. -Wei - Original Message - From: Dane Miller d...@optimalsocial.com To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 10:54:50 AM Subject: repair hangs Hi, On one of my nodes, nodetool repair -pr has been running for 48 hours and appears to be hung, with no output and no AntiEntropy messages in system.log for 40+ hours. Load, cpu, etc are all near zero. There are no other repair jobs running in my cluster. What's the recommended way to deal with a hung repair job? Is it the symptom of a larger problem? More info follows... On the node where the repair is running/hung, nodetool tpstats shows 1 Active and 1 Pending AntiEntropySessions. nodetool netstats reports Not sending any streams. Not receiving any streams. I created this cluster by copying and restoring snapshots from another cluster. The new cluster has the same number of nodes and same tokens as the original. However, the rack assignment is different: the new cluster uses a single rack, the original cluster uses multiple racks. The replication strategy is SimpleStrategy for all keyspaces. Details: 6 node cluster cassandra 1.2.2 RandomPartitioner, EC2Snitch Ubuntu 12.04 x86_64 EC2 m1.large Thanks, Dane
Re: About the heap
Here is the JIRA I submitted regarding the ancestor. https://issues.apache.org/jira/browse/CASSANDRA-5342 -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:35:29 AM Subject: Re: About the heap Hi Dean, The index_interval is controlling the sampling of the SSTable to speed up the lookup of the keys in the SSTable. Here is the code: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478 To increase the interval meaning, taking less samples, less memory, slower lookup for read. I did do a heap dump on my production system which caused about 10 seconds pause of the node. I found something interesting, for LCS, it could involve thousands of SSTables for one compaction, the ancestors are recorded in case something goes wrong during the compaction. But those are never removed after the compaction is done. In our case, it takes about 1G of heap memory to store that. I am going to submit a JIRA for that. Here is the culprit: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58 Enjoy looking at Cassandra code:) -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:11:14 AM Subject: Re: About the heap Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which gave us smaller bloomfilters. As far as key cache. There is an entry in cassandra.yaml called index_interval set to 128. I am not sure if that is related to key_cache. I think it is. By turning that to 512 or maybe even 1024, you will consume less ram there as well though I ran this test in QA and my key cache size stayed the same so I am really not sure(I am actually checking out cassandra code now to dig a little deeper into this property. Dean From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, March 13, 2013 10:11 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: About the heap Hi, I would like to know everything that is in the heap. We are here speaking of C*1.1.6 Theory : - Memtable (1024 MB) - Key Cache (100 MB) - Row Cache (disabled, and serialized with JNA activated anyway, so should be off-heap) - BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter Space Used and considering they are showed in Bytes - 1103765112) - Anything else ? So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly (from the new BF of my new data). My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the max 8 GB (crashing the node). Because of this I have an unstable cluster and have no other choice than use Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes. What am I missing ? Practice : Is there a way not inducing any load and easy to do to dump the heap to analyse it with MAT (or anything else that you could advice) ? Alain
Re: repair hangs
My guess would be there is some exception during the repair and your session is aborted. Here is the code of doing repair: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java looking for logger.info Compare that with your log file, it should give you a rough idea in which stage repaired died. -Wei - Original Message - From: Dane Miller d...@optimalsocial.com To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Wednesday, March 13, 2013 12:32:20 PM Subject: Re: repair hangs On Wed, Mar 13, 2013 at 11:44 AM, Wei Zhu wz1...@yahoo.com wrote: Do you see anything related to merkle tree in your log? Also do a nodetool compactionstats, during merkle tree calculation, you will see validation there. The last mention of merkle is 2 days old. compactionstats are: $ nodetool compactionstats pending tasks: 7 Active compaction remaining time :n/a Does this help explain anything? Dane
Re: Size Tiered - Leveled Compaction
I have the same wonder. We started with the default 5M and the compaction after repair takes too long on 200G node, so we increase the size to 10M sort of arbitrarily since there is not much documentation around it. Our tech op team still thinks there are too many files in one directory. To fulfill the guidelines from them (don't remember the exact number, but something in the range of 50K files), we will need to increase the size to around 50M. I think the latency of opening one file is not impacted much by the number of files in one directory for the modern file system. But ls and other operations suffer. Anyway, I asked about the side effect of the bigger SSTable in IRC, someone was mentioning during read, C* reads the whole SSTable from disk in order to access the row which causes more disk IO compared with the smaller SSTable. I don't know enough about the internal of the Cassandra, not sure whether it's the case or not. If that is the case (with question mark) , the SSTable or the row is kept in the memory? Hope someone can confirm the theory here. Or I have to dig in to the source code to find it. Another concern is during repair, does it stream the whole SSTable or the partial of it when mismatch is detected? I see the claim for both, can someone please confirm also? The last thing is the effectiveness of the parallel LCS on 1.2. It takes quite some time for the compaction to finish after repair for LCS for 1.1.X. Both CPU and disk Util is low during the compaction which means LCS doesn't fully utilized resource. It will make the life easier if the issue is addressed in 1.2. Bottom line is that there is not much documentation/guideline/successful story around LCS although it sounds beautiful on paper. Thanks. -Wei From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org Cc: Wei Zhu wz1...@yahoo.com Sent: Friday, March 8, 2013 1:25 AM Subject: Re: Size Tiered - Leveled Compaction I'm still wondering about how to chose the size of the sstable under LCS. Defaul is 5MB, people use to configure it to 10MB and now you configure it at 128MB. What are the benefits or inconveniants of a very small size (let's say 5 MB) vs big size (like 128MB) ? Alain 2013/3/8 Al Tobey a...@ooyala.com We saw the exactly the same thing as Wei Zhu, 100k tables in a directory causing all kinds of issues. We're running 128MiB ssTables with LCS and have disabled compaction throttling. 128MiB was chosen to get file counts under control and reduce the number of files C* has to manage search. I just looked and a ~250GiB node is using about 10,000 files, which is quite manageable. This configuration is running smoothly in production under mixed read/write load. We're on RAID0 across 6 15k drives per machine. When we migrated data to this cluster we were pushing well over 26k/s+ inserts with CL_QUORUM. With compaction throttling enabled at any rate it just couldn't keep up. With throttling off, it runs smoothly and does not appear to have an impact on our applications, so we always leave it off, even in EC2. An 8GiB heap is too small for this config on 1.1. YMMV. -Al Tobey On Thu, Feb 14, 2013 at 12:51 PM, Wei Zhu wz1...@yahoo.com wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei From: Charles Brophy cbro...@zulily.com To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:29 AM Subject: Re
Re: should I file a bug report on this or is this normal?
It seems to be normal to explode data size during repair. For our case, we have a node around 200G with RF =3, during repair, it goes to as high as 300G. We are using LCS, it creates more than 5000 compaction tasks and takes more than a day to finish. We are on 1.1.6 There is parallel LCS feature on 1.2, it is supposed to speed up the LCS. Let us know how it goes for you since you are using LCS on 1.2 Also there are a few JIRAs related to this issue: https://issues.apache.org/jira/browse/CASSANDRA-2698 https://issues.apache.org/jira/browse/CASSANDRA-3721 Thanks. -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Wednesday, March 6, 2013 8:29:16 AM Subject: Re: should I file a bug report on this or is this normal? 15. Size of nreldata is now 220K ….it has exploded in size!! This may be explained by fragmentation in the sstables, which compaction would eventually resolve. During repair the data came from multiple nodes and created multiple sstables for each CF. Streaming copies part of an SSTable on the source and creates an SSTable on the destination. This pattern is different to all writes for a CF going to the same sstable when flushed. To compare apples to apples run a major compaction after the initial data load, and after the repair. 1. Why is the bloomfilter for level 5 a total of 3856 bytes for 29118(large to small) bytes of data while in the initial data it was 2192 bytes for 43038(small to large) bytes of data? The size of the BF depends on the number of rows and the false positive rate. Not the size of the -Data.db component on disk. 2. Why is there 3 levels? With such a small set of data, I would think it would flush one data file like the original data but instead there is 3 files. See above. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/03/2013, at 6:40 AM, Hiller, Dean dean.hil...@nrel.gov wrote: I ran a pretty solid QA test(cleaned data from scratch) on version 1.2.2 My test was as so 1. Start up 4 node cassandra cluster 2. Populate with initial test data (no other data is added to system after this point!!!) 3. Run nodetool drain on every node(move stuff from commit log to sstables) 4. Stop and start cassandra cluster to have it running again 5. Get size of nreldata CF folder is 128kB 6. Go to node 3, run snapshot and mv snapshots directory OUT of nreldata 7. Get size of nreldata CF folder is 128kB 8. On node 3, run nodetool drain 9. Get size of nreldataCF folder is still 128kB 10. Stop cassandra node 11. Rm keyspace/nreldata/*.db 12. Size of nreldata CF is 8kb(odd of an empty folder but ok) 13. Start cassandra 14. Nodetool repair databus5 nreldata 15. Size of nreldata is now 220K ….it has exploded in size!! I ran this QA test as we see data size explosion in production as well(I can't be 100% sure if this is the same thing though as above is such a small data set). Would leveled compaction be a bit more stable in terms of size ratios and such. QUESTIONS 1. Why is the bloomfilter for level 5 a total of 3856 bytes for 29118(large to small) bytes of data while in the initial data it was 2192 bytes for 43038(small to large) bytes of data? 2. Why is there 3 levels? With such a small set of data, I would think it would flush one data file like the original data but instead there is 3 files. My files after repair have levels 5, 6, and 7. My files before deletion of the CF have just level 1. After repair files are -rw-rw-r--. 1 cassandra cassandra 54 Mar 6 07:18 databus5-nreldata-ib-5-CompressionInfo.db -rw-rw-r--. 1 cassandra cassandra 29118 Mar 6 07:18 databus5-nreldata-ib-5-Data.db -rw-rw-r--. 1 cassandra cassandra 3856 Mar 6 07:18 databus5-nreldata-ib-5-Filter.db -rw-rw-r--. 1 cassandra cassandra 37000 Mar 6 07:18 databus5-nreldata-ib-5-Index.db -rw-rw-r--. 1 cassandra cassandra 4772 Mar 6 07:18 databus5-nreldata-ib-5-Statistics.db -rw-rw-r--. 1 cassandra cassandra 383 Mar 6 07:18 databus5-nreldata-ib-5-Summary.db -rw-rw-r--. 1 cassandra cassandra 79 Mar 6 07:18 databus5-nreldata-ib-5-TOC.txt -rw-rw-r--. 1 cassandra cassandra 46 Mar 6 07:18 databus5-nreldata-ib-6-CompressionInfo.db -rw-rw-r--. 1 cassandra cassandra 14271 Mar 6 07:18 databus5-nreldata-ib-6-Data.db -rw-rw-r--. 1 cassandra cassandra 816 Mar 6 07:18 databus5-nreldata-ib-6-Filter.db -rw-rw-r--. 1 cassandra cassandra 18248 Mar 6 07:18 databus5-nreldata-ib-6-Index.db -rw-rw-r--. 1 cassandra cassandra 4756 Mar 6 07:18 databus5-nreldata-ib-6-Statistics.db -rw-rw-r--. 1 cassandra cassandra 230 Mar 6 07:18 databus5-nreldata-ib-6-Summary.db -rw-rw-r--. 1 cassandra cassandra 79 Mar 6 07:18 databus5-nreldata-ib-6-TOC.txt -rw-rw-r--. 1 cassandra cassandra 46 Mar 6 07:18 databus5-nreldata-ib-7-CompressionInfo.db -rw-rw-r--. 1 cassandra cassandra 14271 Mar 6 07:18
Re: Write latency spikes
If you are tight about your SLA, try set socketTimeout from Hector with small number so that it can retry faster given the assumption that your write is idempotent. Regarding your write latency, don't have much insight. We see spike on the reads due to GC/compaction etc. But not write latency. Thanks. -Wei - Original Message - From: Jouni Hartikainen jouni.hartikai...@reaktor.fi To: user@cassandra.apache.org Sent: Wednesday, March 6, 2013 10:44:49 PM Subject: Write latency spikes Hi all, I'm experiencing strange latency spikes when writing and trying to figure out what could cause them. My setup: - 3 nodes, writing at CL.ONE using Hector client, no reads - Writing simultaneously to 3 CFs, inserts with 25h TTL, no deletes, no updates, RF 3 - 2 CFs have small data (row count 2000, row size 500kB, column count/row 15 000) - 1 CF has lots of binary data split into ~60kB columns (row count 550 000, row sizes 2MB, column count/row 40) - Write rate ~300 inserts / s for each CF, total write throughput ~25 MB (bytes) / second - data is time series using timestamp as column key - Cassandra 1.2.2 with 256 vnodes on each machine - Key cache at default 100MB, no row cache - 1 x Xeon L5430 CPU, 16GB RAM, 2.3T disc on RAID10 (10k SAS), Sun/Oracle JDK 1.6 (tried also 1.7), 4GB JVM heap, JNA enabled - all nodes in the same DC, 1Gb network, sub ms latencies between nodes cassandra.yaml: http://pastebin.com/MSr2prpb cfstats: http://pastebin.com/Ax5vPUcY example cfhistograms: http://pastebin.com/qYSL1MX3 example proxy histograms: http://pastebin.com/X3AGGEjh With this setup I usually get quite nice write latencies of less than 20ms, but sometimes (~once in a every few minutes) latencies momentarily spike to more than 300ms maxing out at ~2.5 seconds. Spikes are short ( 1 s) and happen on all nodes (but not at the same time). Even if avg latencies are very good, these spikes cause us headaches due to our SLA. While investigating I have learned the following: - No evident GC pressure (nothing in C* logs, GC logging showing constantly 30ms collection pauses) - No I/O bounds (disks provide ~1GB/s linear write and are mostly idle apart from memtable flushes for every ~11s) - No relation between spikes compaction - No queuing in memtable FlushWriter, no blocked memtable flushes - Nothing alarming in logs - No timeouts, no errors on the client side - Each client (3 separate machines) experience latencies simultaneously which points to cause being in C*, not in the client - CPU load 10% ( 20% while compacting) - Latencies measured both from the client and observed using nodetool cfhistograms Now I'm running out of ideas about what might cause the spikes as I have understood that there is really not that many places on the write path that could block. Any ideas? -Jouni
Re: Bloom filters and LCS
Where did you read that bloom filters are off for LCS on 1.1.9? Those are the two issues I can find regarding this matter: https://issues.apache.org/jira/browse/CASSANDRA-4876 https://issues.apache.org/jira/browse/CASSANDRA-5029 Looks like in 1.2, it defaults at 0.1, not sure about 1.1.X -Wei - Original Message - From: Michael Theroux mthero...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, March 7, 2013 1:18:38 PM Subject: Bloom filters and LCS Hello, (Hopefully) Quick question. We are running Cassandra 1.1.9. I recently converted some tables from Size tiered to Leveled Compaction. The amount of space for Bloom Filters on these tables went down tremendously (which is expected, LCS in 1.1.9 does not use bloom filters). However, although its far less, its still using a number of megabytes. Why is it not zero? Column Family: SSTable count: 526 Space used (live): 7251063348 Space used (total): 7251063348 Number of Keys (estimate): 23895552 Memtable Columns Count: 45719 Memtable Data Size: 21207173 Memtable Switch Count: 579 Read Count: 21773431 Read Latency: 4.155 ms. Write Count: 16183367 Write Latency: 0.029 ms. Pending Tasks: 0 Bloom Filter False Positives: 2442 Bloom Filter False Ratio: 0.00245 Bloom Filter Space Used: 44674656 Compacted row minimum size: 73 Compacted row maximum size: 105778 Compacted row mean size: 1104 Thanks, -Mike
Re: Cassandra instead of memcached
It also depends on you SLA, it should work for 99% of the time. But one GC/flush/compact could screw things up big time if you have tight SLA. -Wei From: Drew Kutcharian d...@venarc.com To: user@cassandra.apache.org Sent: Wednesday, March 6, 2013 9:32 AM Subject: Re: Cassandra instead of memcached I think the dataset should fit in memory easily. The main purpose of this would be as a store for an API rate limiting/accounting system. I think ebay guys are using C* too for the same reason. Initially we were thinking of using Hazelcast or memcahed. But Hazelcast (at least the community edition) has Java gc issues with big heaps and the problem with memcached is lack of a reliable distribution (you lose a node, you need to rehash everything), so I figured why not just use C*. On Mar 6, 2013, at 9:08 AM, Edward Capriolo edlinuxg...@gmail.com wrote: If your writing much more data then RAM cassandra will not work as fast as memcache. Cassandra is not magical, if all of your data fits in memory it is going to be fast, if most of your data fits in memory it can still be fast. However if you plan on having much more data then disk you need to think about more RAM and OR SSD disks. We do not use c* as an in-memory store. However for many of our datasets we do not have a separate caching tier. In those cases cassandra is both our database and our in-memory store if you want to use those terms :) On Wed, Mar 6, 2013 at 12:02 PM, Drew Kutcharian d...@venarc.com wrote: Thanks guys, this is what I was looking for. @Edward. I definitely like crazy ideas ;), I think the only issue here is that C* is a disk space hug, so not sure if that would be feasible since free RAM is not as abundant as disk. BTW, I watched your presentation, are you guys still using C* as in-memory store? On Mar 6, 2013, at 7:44 AM, Edward Capriolo edlinuxg...@gmail.com wrote: http://www.slideshare.net/edwardcapriolo/cassandra-as-memcache Read at ONE. READ_REPAIR_CHANCE as low as possible. Use short TTL and short GC_GRACE. Make the in memory memtable size as high as possible to avoid flushing and compacting. Optionally turn off commit log. You can use cassandra like memcache but it is not a memcache replacement. Cassandra persists writes and compacts SSTables, memcache only has to keep data in memory. If you want to try a crazy idea. try putting your persistent data on a ram disk! Not data/system however! On Wed, Mar 6, 2013 at 2:45 AM, aaron morton aa...@thelastpickle.com wrote: consider disabling durable_writes in the KS config to remove writing to the commit log. That will speed things up for you. Note that you risk losing data is cassandra crashes or is not shut down with nodetool drain. Even if you set the gc_grace to 0, deletes will still need to be committed to disk. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com/ On 5/03/2013, at 9:51 AM, Drew Kutcharian d...@venarc.com wrote: Thanks Ben, that article was actually the reason I started thinking about removing memcached. I wanted to see what would be the optimum config to use C* as an in-memory store. -- Drew On Mar 5, 2013, at 2:39 AM, Ben Bromhead b...@instaclustr.com wrote: Check out http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html Netflix used Cassandra with SSDs and were able to drop their memcache layer. Mind you they were not using it purely as an in memory KV store. Ben Instaclustr | www.instaclustr.com | @instaclustr On 05/03/2013, at 4:33 PM, Drew Kutcharian d...@venarc.com wrote: Hi Guys, I'm thinking about using Cassandra as an in-memory key/value store instead of memcached for a new project (just to get rid of a dependency if possible). I was thinking about setting the replication factor to 1, enabling off-heap row-cache and setting gc_grace_period to zero for the CF that will be used for the key/value store. Has anyone tried this? Any comments? Thanks, Drew
Re: Poor read latency
According to this: https://issues.apache.org/jira/browse/CASSANDRA-5029 Bloom filter is still on by default for LCS in 1.2.X Thanks. -Wei From: Hiller, Dean dean.hil...@nrel.gov To: user@cassandra.apache.org user@cassandra.apache.org Sent: Monday, March 4, 2013 10:42 AM Subject: Re: Poor read latency Recommended settings are 8G RAM and your memory grows with the number of rows through index samples(configured in cassandra.yaml as samples per row something…look for the word index). Also, bloomfilters grow with RAM if using size tiered compaction. We are actually trying to switch to leveled compaction in 1.2.2 as I think the default is no bloomfilters as LCS does not really need them I think since 90% of rows are in highest tier(but this just works better for certain type profiles like very heavy read vs. the number of writes). Later, Dean From: Tom Martin tompo...@gmail.commailto:tompo...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, March 4, 2013 11:20 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Poor read latency Yeah, I just checked and the heap size 0.75 warning has been appearing. nodetool info reports: Heap Memory (MB) : 563.88 / 1014.00 Heap Memory (MB) : 646.01 / 1014.00 Heap Memory (MB) : 639.71 / 1014.00 We have plenty of free memory on each instance. Do we need bigger instances or should we just configure each node to have a bigger max heap? On Mon, Mar 4, 2013 at 6:10 PM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: What is nodetool info say for your memory? (we hit that one with memory near the max and it slowed down our system big time…still working on resolving it too). Do any logs have the hit 0.75, running compaction OR worse hit 0.85 running compaction….you get that if the above is the case typically. Dean From: Tom Martin tompo...@gmail.commailto:tompo...@gmail.commailto:tompo...@gmail.commailto:tompo...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, March 4, 2013 10:31 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Poor read latency Hi all, We have a small (3 node) cassandra cluster on aws. We have a replication factor of 3, a read level of local_quorum and are using the ephemeral disk. We're getting pretty poor read performance and quite high read latency in cfstats. For example: Column Family: AgentHotel SSTable count: 4 Space used (live): 829021175 Space used (total): 829021175 Number of Keys (estimate): 2148352 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 67204 Read Latency: 23.813 ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Positives: 50 Bloom Filter False Ratio: 0.00201 Bloom Filter Space Used: 7635472 Compacted row minimum size: 259 Compacted row maximum size: 4768 Compacted row mean size: 873 For comparison we have a similar set up in another cluster for an old project (hosted on rackspace) where we're getting sub 1ms read latencies. We are using multigets on the client (Hector) but are only requesting ~40 rows per request on average. I feel like we should reasonably expect better performance but perhaps I'm mistaken. Is there anything super obvious we should be checking out?
Re: what size file for LCS is best for 300-500G per node?
We have 200G and ended going with 10M. The compaction after repair takes a day to finish. Try to run a repair and see how it goes. -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Monday, March 4, 2013 10:52:27 AM Subject: what size file for LCS is best for 300-500G per node? Should we really be going with 5MB when it compresses to 3MB? That seems to be on the small side, right? We have ulimit cranked up so many files shouldn't be an issue but maybe we should go to 10MB or 100MB or something in between? Does anyone have any experience with changing the LCS sizes? I do read somewhere startup times of opening 100,000 files could be slow? Which implies a larger size so less files might be better? Thanks, Dean
Re: Mutation dropped
Thanks Aaron for the great information as always. I just checked cfhistograms and only a handful of read latency are bigger than 100ms, but for proxyhistograms there are 10 times more are greater than 100ms. We are using QUORUM for reading with RF=3, and I understand coordinator needs to get the digest from other nodes and read repair on the miss match etc. But is it normal to see the latency from proxyhistograms to go beyond 100ms? Is there anyway to improve that? We are tracking the metrics from Client side and we see the 95th percentile response time averages at 40ms which is a bit high. Our 50th percentile was great under 3ms. Any suggestion is very much appreciated. Thanks. -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: Cassandra User user@cassandra.apache.org Sent: Thursday, February 21, 2013 9:20:49 AM Subject: Re: Mutation dropped What does rpc_timeout control? Only the reads/writes? Yes. like data stream, streaming_socket_timeout_in_ms in the yaml merkle tree request? Either no time out or a number of days, cannot remember which right now. What is the side effect if it's set to a really small number, say 20ms? You will probably get a lot more requests that fail with a TimedOutException. rpc_timeout needs to be longer than the time it takes a node to process the message, and the time it takes the coordinator to do it's thing. You can look at cfhistograms and proxyhistograms to get a better idea of how long a request takes in your system. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote: What does rpc_timeout control? Only the reads/writes? How about other inter-node communication, like data stream, merkle tree request? What is the reasonable value for roc_timeout? The default value of 10 seconds are way too long. What is the side effect if it's set to a really small number, say 20ms? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 7:32 PM Subject: Re: Mutation dropped Does the rpc_timeout not control the client timeout ? No it is how long a node will wait for a response from other nodes before raising a TimedOutException if less than CL nodes have responded. Set the client side socket timeout using your preferred client. Is there any param which is configurable to control the replication timeout between nodes ? There is no such thing. rpc_timeout is roughly like that, but it's not right to think about it that way. i.e. if a message to a replica times out and CL nodes have already responded then we are happy to call the request complete. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron. Does the rpc_timeout not control the client timeout ? Is there any param which is configurable to control the replication timeout between nodes ? Or the same param is used to control that since the other node is also like a client ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 17 February 2013 11:26 To: user@cassandra.apache.org Subject: Re: Mutation dropped You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some
Re: Mutation dropped
What does rpc_timeout control? Only the reads/writes? How about other inter-node communication, like data stream, merkle tree request? What is the reasonable value for roc_timeout? The default value of 10 seconds are way too long. What is the side effect if it's set to a really small number, say 20ms? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 7:32 PM Subject: Re: Mutation dropped Does the rpc_timeout not control the client timeout ?No it is how long a node will wait for a response from other nodes before raising a TimedOutException if less than CL nodes have responded. Set the client side socket timeout using your preferred client. Is there any param which is configurable to control the replication timeout between nodes ?There is no such thing. rpc_timeout is roughly like that, but it's not right to think about it that way. i.e. if a message to a replica times out and CL nodes have already responded then we are happy to call the request complete. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron. Does the rpc_timeout not control the client timeout ? Is there any param which is configurable to control the replication timeout between nodes ? Or the same param is used to control that since the other node is also like a client ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 17 February 2013 11:26 To: user@cassandra.apache.org Subject: Re: Mutation dropped You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some timeout associated with replicated data persistence ? Thanks, Kanwar From: Kanwar Sangha [mailto:kan...@mavenir.com] Sent: 14 February 2013 09:08 To: user@cassandra.apache.org Subject: Mutation dropped Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing a lot of mutation dropped messages. I understand that this is due to the replica not being written to the other node ? RF = 2, CL =1. From the wiki - For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair Thanks, Kanwar
Re: cassandra vs. mongodb quick question(good additional info)
From my limited experience with Mongo, it seems that Mongo only performs when the whole data set is in the memory which makes me wonder how the 40TB data works.. - Original Message - From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 7:02:56 AM Subject: Re: cassandra vs. mongodb quick question(good additional info) The 40 TB use case you heard about is probably one 40TB mysql machine that someone migrated to mongo so it would be web scale Cassandra is NOT good with drives that big, get a blade center or a high density chassis. On Mon, Feb 18, 2013 at 8:00 PM, Hiller, Dean dean.hil...@nrel.gov wrote: I thought about this more, and even with a 10Gbit network, it would take 40 days to bring up a replacement node if mongodb did truly have a 42T / node like I had heard. I wrote the below email to the person I heard this from going back to basics which really puts some perspective on it….(and a lot of people don't even have a 10Gbit network like we do) Nodes are hooked up by a 10G network at most right now where that is 10gigabit. We are talking about 10Terabytes on disk per node recently. Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second (yes I could have divided by 8 in my head but eh…course when I saw the number, I went duh) So trying to transfer 10 Terabytes or 10,000 Gigabytes to a node that we are bringing online to replace a dead node would take approximately 5 days??? This means no one else is using the bandwidth too ;). 10,000Gigabytes * 1 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days. This is more likely 11 days if we only use 50% of the network. So bringing a new node up to speed is more like 11 days once it is crashed. I think this is the main reason the 1Terabyte exists to begin with, right? From an ops perspective, this could sound like a nightmare scenario of waiting 10 days…..maybe it is livable though. Either way, I thought it would be good to share the numbers. ALSO, that is assuming the bus with it's 10 disk can keep up with 10G Can it? What is the limit of throughput on a bus / second on the computers we have as on wikipedia there is a huge variance? What is the rate of the disks too (multiplied by 10 of course)? Will they keep up with a 10G rate for bringing a new node online? This all comes into play even more so when you want to double the size of your cluster of course as all nodes have to transfer half of what they have to all the new nodes that come online(cassandra actually has a very data center/rack aware topology to transfer data correctly to not use up all bandwidth unecessarily…I am not sure mongodb has that). Anyways, just food for thought. From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 18, 2013 1:39 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no Subject: Re: cassandra vs. mongodb quick question My experience is repair of 300GB compressed data takes longer than 300GB of uncompressed, but I cannot point to an exact number. Calculating the differences is mostly CPU bound and works on the non compressed data. Streaming uses compression (after uncompressing the on disk data). So if you have 300GB of compressed data, take a look at how long repair takes and see if you are comfortable with that. You may also want to test replacing a node so you can get the procedure documented and understand how long it takes. The idea of the soft 300GB to 500GB limit cam about because of a number of cases where people had 1 TB on a single node and they were surprised it took days to repair or replace. If you know how long things may take, and that fits in your operations then go with it. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 10:08 PM, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no wrote: Just out of curiosity : When using compression, does this affect this one way or another? Is 300G (compressed) SSTable size, or total size of data? .vegard, - Original Message - From: user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: Sent: Mon, 18 Feb 2013 08:41:25 +1300 Subject: Re: cassandra vs. mongodb quick question If you have spinning disk and 1G networking and no virtual nodes, I would still say 300G to 500G is a soft limit. If you are using virtual nodes, SSD, JBOD disk configuration or faster networking you may go higher. The limiting factors are the time it take
Re: Long running nodetool repair
It should not take that long. For my 200G node, it takes about an hour to calculate the Merkle tree and then data streaming. By the way, how do you know the repair is not done? If you run nodetool tpstats, it should give you the AntiEntropy session info, active/pending/completed etc. While calculating Merkle tree, you can see the progress from nodetool compactionstats. While streaming data, you can see the progress from nodetool netstats. Also you can grep the log by Merkle and repair. - Original Message - From: Haithem Jarraya haithem.jarr...@struq.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 1:29:19 AM Subject: Long running nodetool repair Hi, I am new to Cassandra and I am not sure if this is the normal behavior but nodetool repair runs for too long even for small dataset per node. As I am writing I started a nodetool repair last night at 18:41 and now it's 9:18 and it's still running, the size of my data is only ~500mb per node. We have 3 Node cluster in DC1 with RF 3 1 Node Cluster in DC2 with RF 1 1 Node cluster in DC3 with RF 1 and running Cassandra V1.2.1 with 256 vNodes. From cassandra logs I do not see AntiEntropy logs anymore only compaction Task and FlushWriter. Is this a normal behaviour of nodetool repair? Is the running time grow linearly with the size of the data? Any help or direction will be much appreciated. Thanks, H
Re: Size Tiered - Leveled Compaction
We doubled the SStable size to 10M. It still generates a lot of SSTable and we don't see much difference of the read latency. We are able to finish the compactions after repair within serveral hours. We will increase the SSTable size again if we feel the number of SSTable hurts the performance. - Original Message - From: Mike mthero...@yahoo.com To: user@cassandra.apache.org Sent: Sunday, February 17, 2013 4:50:40 AM Subject: Re: Size Tiered - Leveled Compaction Hello Wei, First thanks for this response. Out of curiosity, what SSTable size did you choose for your usecase, and what made you decide on that number? Thanks, -Mike On 2/14/2013 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei From: Charles Brophy cbro...@zulily.com To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:29 AM Subject: Re: Size Tiered - Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered - Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running nodetool sstableupgrade on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Re: heap usage
We have 250G data and running at 8GB heap and one of the node is OOM during repair. I checked bloomfilter, only 200M. Not sure how the memory is used, maybe take a memory dump and exam that. - Original Message - From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org Sent: Friday, February 15, 2013 8:16:23 AM Subject: Re: heap usage It is not going to be true for long that LCS does not require bloom filters. https://issues.apache.org/jira/browse/CASSANDRA-5029 Apparently, without bloom filters there are issues. On Fri, Feb 15, 2013 at 7:29 AM, Blake Manders bl...@crosspixel.net wrote: You probably want to look at your bloom filters. Be forewarned though, they're difficult to change; changes to bloom filter settings only apply to new SSTables, so they might not be noticeable until a few compactions have taken place. If that is your issue, and your usage model fits it, a good alternative to the slow propagation of higher miss rates is to switch to LCS (which doesn't use bloom filters), which won't require you to make the jump to 1.2. On Fri, Feb 15, 2013 at 4:06 AM, Reik Schatz reik.sch...@gmail.com wrote: Hi, recently we are hitting some OOM: Java heap space, so I was investigating how the heap is used in Cassandra 1.2+ We use the calculated 4G heap. Our cluster is 6 nodes, around 750 GB data and a replication factor of 3. Row cache is disabled. All key cache and memtable settings are left at default. Is the primary key index kept in heap memory? We have a bunch of keyspaces and column families. Thanks, Rik -- Blake Manders | CTO Cross Pixel, Inc. | 494 8th Ave, Penthouse | NYC 10001 Website: crosspixel.net Twitter: twitter.com/CrossPix
Re: Cluster not accepting insert while one node is down
From the exception, looks like astyanax didn't even try to call Cassandra. My guess would be astyanax is token aware, it detects the node is down and it doesn't even try. If you use Hector, it might try to write since it's not token aware. But As Byran said, it eventually will fail. I guess hinted hand off won't help since the write doesn't satisfy CL.ONE. From: Bryan Talbot btal...@aeriagames.com To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:30 AM Subject: Re: Cluster not accepting insert while one node is down Generally data isn't written to whatever node the client connects to. In your case, a row is written to one of the nodes based on the hash of the row key. If that one replica node is down, it won't matter which coordinator node you attempt a write with CL.ONE: the write will fail. If you want the write to succeed, you could do any one of: write with CL.ANY, increase RF to 2+, write using a row key that hashes to an UP node. -Bryan On Thu, Feb 14, 2013 at 2:06 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: I will let commiters or anyone that has knowledge on Cassandra internal answer this. From what I understand, you should be able to insert data on any up node with your configuration... Alain 2013/2/14 Traian Fratean traian.frat...@gmail.com You're right as regarding data availability on that node. And my config, being the default one, is not suited for a cluster. What I don't get is that my 67 node was down and I was trying to insert in 66 node, as can be seen from the stacktrace. Long story short: when node 67 was down I could not insert into any machine in the cluster. Not what I was expecting. Thank you for the reply!Traian. 2013/2/14 Alain RODRIGUEZ arodr...@gmail.com Hi Traian, There is your problem. You are using RF=1, meaning that each node is responsible for its range, and nothing more. So when a node goes down, do the math, you just can't read 1/5 of your data. This is very cool for performances since each node owns its own part of the data and any write or read need to reach only one node, but it removes the SPOF, which is a main point of using C*. So you have poor availability and poor consistency. An usual configuration with 5 nodes would be RF=3 and both CL (RW) = QUORUM. This will replicate your data to 2 nodes + the natural endpoints (total of 3/5 nodes owning any data) and any read or write would need to reach at least 2 nodes before being considered as being successful ensuring a strong consistency. This configuration allow you to shut down a node (crash or configuration update/rolling restart) without degrading the service (at least allowing you to reach any data) but at cost of more data on each node. Alain 2013/2/14 Traian Fratean traian.frat...@gmail.com I am using defaults for both RF and CL. As the keyspace was created using cassandra-cli the default RF should be 1 as I get it from below: [default@TestSpace] describe; Keyspace: TestSpace: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [datacenter1:1] As for the CL it the Astyanax default, which is 1 for both reads and writes. Traian. 2013/2/13 Alain RODRIGUEZ arodr...@gmail.com We probably need more info like the RF of your cluster and CL of your reads and writes. Maybe could you also tell us if you use vnodes or not. I heard that Astyanax was not running very smoothly on 1.2.0, but a bit better on 1.2.1. Yet, Netflix didn't release a version of Astyanax for C*1.2. Alain 2013/2/13 Traian Fratean traian.frat...@gmail.com Hi, I have a cluster of 5 nodes running Cassandra 1.2.0 . I have a Java client with Astyanax 1.56.21. When a node(10.60.15.67 - diiferent from the one in the stacktrace below) went down I get TokenRandeOfflineException and no other data gets inserted into any other node from the cluster. Am I having a configuration issue or this is supposed to happen? com.netflix.astyanax.connectionpool.impl.CountingConnectionPoolMonitor.trackError(CountingConnectionPoolMonitor.java:81) - com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160, latency=2057(2057), attempts=1]UnavailableException() com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160, latency=2057(2057), attempts=1]UnavailableException() at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:27) at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$1.execute(ThriftSyncConnectionFactoryImpl.java:140) at
Re: Size Tiered - Leveled Compaction
I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei From: Charles Brophy cbro...@zulily.com To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:29 AM Subject: Re: Size Tiered - Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered - Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running nodetool sstableupgrade on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Re: Why do Datastax docs recommend Java 6?
Anyone has first hand experience with Zing JVM which is claimed to be pauseless? How do they charge, per CPU?Thanks-WeiFrom: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org Sent: Wednesday, February 6, 2013 7:07 AM Subject: Re: Why do Datastax docs recommend Java 6? Oracle already did this once, It was called jrockit :)http://www.oracle.com/technetwork/middleware/jrockit/overview/index.html Typically oracle acquires they technology and then the bits are merged with the standard JVM.On Wed, Feb 6, 2013 at 2:13 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: I would prefer Oracle to own an Azul’s Zing JVM over any other (GC) to provide it for free for anyone :) Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider Take a ride with Adform's Rich Media Suite Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: jef...@gmail.com [mailto:jef...@gmail.com] Sent: Wednesday, February 06, 2013 02:23 To: user@cassandra.apache.org Subject: Re: Why do Datastax docs recommend Java 6? Oracle now owns the sun hotspot team, which is inarguably the highest powered java vm team in the world. Its still really the epicenter of all java vm development. Sent from my Verizon Wireless BlackBerry From: "Ilya Grebnov" i...@metricshub.com Date: Tue, 5 Feb 2013 14:09:33 -0800 To: user@cassandra.apache.org ReplyTo: user@cassandra.apache.org Subject: RE: Why do Datastax docs recommend Java 6? Also, what is particular reason to use Oracle JDK over Open JDK? Sorry, I could not find this information online. Thanks, Ilya From: Michael Kjellman [mailto:mkjell...@barracuda.com] Sent: Tuesday, February 05, 2013 7:29 AM To: user@cassandra.apache.org Subject: Re: Why do Datastax docs recommend Java 6? There have been tons of threads/convos on this. In the early days of Java 7 it was pretty unstable and there was pretty much no convincing reason to use Java 7 over Java 6. Now that Java 7 has stabilized and Java 6 is EOL it's a reasonable decision to use Java 7 and we do it in production with no issues to speak of. That being said there was one potential situation we've seen as a community where bootstrapping new node was using 3x more CPU and getting significantly less throughput. However, reproducing this consistently never happened AFAIK. I think until more people use Java 7 in production and prove it doesn't cause any additional bugs/performance issues Datastax will update their docs. Until now I'd say it's a safe bet to use Java 7 with Vanilla C* 1.2.1. I hope this helps! Best, Michael From: Baron Schwartz ba...@xaprb.com Reply-To: "user@cassandra.apache.org" user@cassandra.apache.org Date: Tuesday, February 5, 2013 7:21 AM To: "user@cassandra.apache.org" user@cassandra.apache.org Subject: Why do Datastax docs recommend Java 6? The Datastax docs repeatedly say (e.g. http://www.datastax.com/docs/1.2/install/install_jre) that Java 7 is not recommended, but they don't say why. It would be helpful to know this. Does anyone know? The same documentation is referenced from the Cassandra wiki, for example,http://wiki.apache.org/cassandra/GettingStarted - Baron
Re: Estimating write throughput with LeveledCompactionStrategy
I have been struggling with the LCS myself. I observed that for the higher level compaction,(from level 4 to 5) it involves much more SSTables than compacting from lower level. One compaction could take an hour or more. By the way, you set the your SSTable size to be 100M? Thanks. -Wei From: Ивaн Cобoлeв sobol...@gmail.com To: user@cassandra.apache.org Sent: Wednesday, February 6, 2013 2:42 AM Subject: Estimating write throughput with LeveledCompactionStrategy Dear Community, Could anyone please give me a hand with understanding what am I missing while trying to model how LeveledCompactionStrategy works: https://docs.google.com/spreadsheet/ccc?key=0AvNacZ0w52BydDQ3N2ZPSks2OHR1dlFmMVV4d1E2eEE#gid=0 Logs mostly contain something like this: INFO [CompactionExecutor:2235] 2013-02-06 02:32:29,758 CompactionTask.java (line 221) Compacted to [chunks-hf-285962-Data.db,chunks-hf-285963-Data.db,chunks-hf-285964-Data.db,chunks-hf-285965-Data.db,chunks-hf-285966-Data.db,chunks-hf-285967-Data.db,chunks-hf-285968-Data.db,chunks-hf-285969-Data.db,chunks-hf-285970-Data.db,chunks-hf-285971-Data.db,chunks-hf-285972-Data.db,chunks-hf-285973-Data.db,chunks-hf-285974-Data.db,chunks-hf-285975-Data.db,chunks-hf-285976-Data.db,chunks-hf-285977-Data.db,chunks-hf-285978-Data.db,chunks-hf-285979-Data.db,chunks-hf-285980-Data.db,]. 2,255,863,073 to 1,908,460,931 (~84% of original) bytes for 36,868 keys at 14.965795MB/s. Time: 121,614ms. Thus spreadsheet is parameterized with throughput being 15Mb and survivor ratio of 0.9. 1) Projected result actually differs from what I observe - what am I missing? 2) Are there any metrics on write throughput with LCS per node anyone could possibly share? Thank you very much in advance, Ivan
Re: Cassandra pending compaction tasks keeps increasing
That is must be it. Yes. it happens to be the seed. I should have tried rebuild. Instead I did repair and now I am sitting here waiting for the compaction to finish... Thanks. -Wei From: Derek Williams de...@fyrie.net To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Friday, February 1, 2013 1:56 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Did the node list itself as a seed node in cassandra.yaml? Unless something has changed, a node that considers itself a seed will not auto bootstrap. Although I haven't tried it, I think running 'nodetool rebuild' will cause it to stream in the data it needs without doing a repair. On Wed, Jan 30, 2013 at 9:30 PM, Wei Zhu wz1...@yahoo.com wrote: Some updates: Since we still have not fully turned on the system. We did something crazy today. We tried to treat the node as dead one. (My boss wants us to practice replacing a dead node before going to full production) and boot strap it. Here is what we did: * drain the node * check nodetool on other nodes, and this node is marked down (the token for this node is 100) * clear the data, commit log, saved cache * change initial_token from 100 to 99 in the yaml file * start the node * check nodetool, the down node of 100 disappeared by itself (!!) and new node with token 99 showed up * checked log, see the message saying bootstrap completed. But only a couple of MB streamed. * nodetool movetoken 98 * nodetool, see the node with token 98 comes up. * check log, see the message saying bootstrap completed. But still only a couple of MB streamed. The only reason I can think of is that the new node has the same IP as the dead node we tried to replace? Will that cause the symptom of no data streamed from other nodes? Other nodes still think the node had all the data? We had to do nodetool repair -pr to bring in the data. After 3 hours, 150G transferred. And no surprise, pending compaction tasks are now at 30K. There are about 30K SStable transferred and I guess all of them needs to be compacted since we use LCS. My concern is that if we did nothing wrong, replacing a dead node will cause such a hugh back log of pending compaction. It might take a week to clear that off. And we have RF = 3, we still need to bring in the data for the other two replicates since we use pr for nodetool repair. It will take about 3 weeks to fully replace a 200G node using LCS? We tried everything we can to speed up the compaction and no luck. The only thing I can think of is to increase the default size of SSTable, so less number of compaction will be needed. Can I just change it in yaml and restart C* and it will correct itself? Any side effect? Since we are using SSD, a bit bigger SSD won't slow down the read too much, I suppose that is the main concern for bigger size of SSTable? I think 1.2 comes with parallel LC which should help the situation. But we are not going to upgrade for a little while. Did I miss anything? It might not be practical to use LCS for 200G node? But if we use Sized compaction, we need to have at least 400G for the HD...Although SSD is cheap now, still hard to convince the management. three replicates + double the Disk for compaction? that is 6 times of the real data size! Sorry for the long email. Any suggestion or advice? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: Cassandra User user@cassandra.apache.org Sent: Tuesday, January 29, 2013 12:59:42 PM Subject: Re: Cassandra pending compaction tasks keeps increasing * Will try it tomorrow. Do I need to restart server to change the log level? You can set it via JMX, and supposedly log4j is configured to watch the config file. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 29/01/2013, at 9:36 PM, Wei Zhu wz1...@yahoo.com wrote: Thanks for the reply. Here is some information: Do you have wide rows ? Are you seeing logging about Compacting wide rows ? * I don't see any log about wide rows Are you seeing GC activity logged or seeing CPU steal on a VM ? * There is some GC, but CPU general is under 20%. We have heap size of 8G, RAM is at 72G. Have you tried disabling multithreaded_compaction ? * By default, it's disabled. We enabled it, but doesn't see much difference. Even a little slower with it's enabled. Is it bad to enable it? We have SSD, according to comment in yaml, it should help while using SSD. Are you using Key Caches ? Have you tried disabling compaction_preheat_key_cache? * We have fairly big Key caches, we set as 10% of Heap which is 800M. Yes, compaction_preheat_key_cache is disabled. Can you enabled DEBUG level logging and make them available ? * Will try it tomorrow. Do I need to restart server to change
General question regarding bootstrap and nodetool repair
Hi, After messing around with my Cassandra cluster recently, I think I need some basic understanding on how things work behind scene regarding data streaming. Let's say we have three node cluster with RF = 3. If node 3 for some reason dies and I want to replace it with a new node with the same (maybe minus one) range. During the bootstrap, how the data is streamed? From what I observed, Node 3 has replicates for its primary range on node 4, 5. So it streams the data from them and starts to compact them. Also, node 3 holds replicates for primary range of node 2, so it streams data from node 2 and node 4. Similarly, it holds replicates for node 1. So data streamed from node 1 and node 2. So during the bootstaping, it basically gets the data from all the replicates (2 copies each), so it will require double the disk space in order to hold the data? Over the time, those SStables will be compacted and redundant will be removed? Is it true? if we issue nodetool repair -pr on node 3, apart from streaming data from node 4, 5 to 3. We also see data stream between node 4, 5 since they hold the replicates. But I don't see log regarding merkle tree calculation on node 4,5. Just wondering how they know what data to stream in order to repair node 4, 5? Thanks. -Wei
Re: General question regarding bootstrap and nodetool repair
I decided to dig in to the source code, looks like in the case of nodetool repair, if the current node sees the difference between the remote nodes based on the merkle tree calculation, it will start a streamrepair session to ask the remote nodes to stream data between each other. But I am still not sure how about the my first question regarding the bootstrap, anyone? Thanks. -Wei From: Wei Zhu wz1...@yahoo.com To: Cassandr usergroup user@cassandra.apache.org Sent: Thursday, January 31, 2013 10:50 AM Subject: General question regarding bootstrap and nodetool repair Hi, After messing around with my Cassandra cluster recently, I think I need some basic understanding on how things work behind scene regarding data streaming. Let's say we have three node cluster with RF = 3. If node 3 for some reason dies and I want to replace it with a new node with the same (maybe minus one) range. During the bootstrap, how the data is streamed? From what I observed, Node 3 has replicates for its primary range on node 4, 5. So it streams the data from them and starts to compact them. Also, node 3 holds replicates for primary range of node 2, so it streams data from node 2 and node 4. Similarly, it holds replicates for node 1. So data streamed from node 1 and node 2. So during the bootstaping, it basically gets the data from all the replicates (2 copies each), so it will require double the disk space in order to hold the data? Over the time, those SStables will be compacted and redundant will be removed? Is it true? if we issue nodetool repair -pr on node 3, apart from streaming data from node 4, 5 to 3. We also see data stream between node 4, 5 since they hold the replicates. But I don't see log regarding merkle tree calculation on node 4,5. Just wondering how they know what data to stream in order to repair node 4, 5? Thanks. -Wei
Re: General question regarding bootstrap and nodetool repair
Thanks Rob. I think you are right on it. Here is what I found: https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L140 It sorts the end point by proximity and in https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L171 It fetches the data from the only one source. That answers my question. So we will have to run repair after the bootstrap to make sure the consistency. Thanks. -Wei From: Rob Coli rc...@palominodb.com To: user@cassandra.apache.org Sent: Thursday, January 31, 2013 1:50 PM Subject: Re: General question regarding bootstrap and nodetool repair On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote: But I am still not sure how about the my first question regarding the bootstrap, anyone? As I understand it, bootstrap occurs from a single replica. Which replica is chosen is based on some internal estimation of which is closest/least loaded/etc. But only from a single replica, so in RF=3, in order to be consistent with both you still have to run a repair. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: General question regarding bootstrap and nodetool repair
One more question though, I tried to replace a node with a new node with the same IP, Here is what we did: * drain the node * check nodetool on other nodes, and this node is marked down (the token for this node is 100) * clear the data, commit log, saved cache on the down node. * change initial_token from 100 to 99 in the yaml file * start the node * check nodetool, the down node of 100 disappeared by itself (!!) and new node with token 99 showed up * checked log, see the message saying bootstrap completed. But only a couple of MB streamed. * nodetool movetoken 98 * nodetool, see the node with token 98 comes up. * check log, see the message saying bootstrap completed. But still only a couple of MB streamed. The only reason I can think of is that the new node has the same IP as the dead node we tried to replace? After reading the bootstrap code, it shouldn't be the case. Is it a bug? Or anyone tried to replace a dead node with the same IP? Thanks. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 31, 2013 3:14:59 PM Subject: Re: General question regarding bootstrap and nodetool repair Thanks Rob. I think you are right on it. Here is what I found: https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L140 It sorts the end point by proximity and in https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L171 It fetches the data from the only one source. That answers my question. So we will have to run repair after the bootstrap to make sure the consistency. Thanks. -Wei From: Rob Coli rc...@palominodb.com To: user@cassandra.apache.org Sent: Thursday, January 31, 2013 1:50 PM Subject: Re: General question regarding bootstrap and nodetool repair On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote: But I am still not sure how about the my first question regarding the bootstrap, anyone? As I understand it, bootstrap occurs from a single replica. Which replica is chosen is based on some internal estimation of which is closest/least loaded/etc. But only from a single replica, so in RF=3, in order to be consistent with both you still have to run a repair. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Cassandra pending compaction tasks keeps increasing
Some updates: Since we still have not fully turned on the system. We did something crazy today. We tried to treat the node as dead one. (My boss wants us to practice replacing a dead node before going to full production) and boot strap it. Here is what we did: * drain the node * check nodetool on other nodes, and this node is marked down (the token for this node is 100) * clear the data, commit log, saved cache * change initial_token from 100 to 99 in the yaml file * start the node * check nodetool, the down node of 100 disappeared by itself (!!) and new node with token 99 showed up * checked log, see the message saying bootstrap completed. But only a couple of MB streamed. * nodetool movetoken 98 * nodetool, see the node with token 98 comes up. * check log, s ee the message saying bootstrap completed. But still only a couple of MB streamed. The only reason I can think of is that the new node has the same IP as the dead node we tried to replace? Will that cause the symptom of no data streamed from other nodes? Other nodes still think the node had all the data? We had to do nodetool repair -pr to bring in the data. After 3 hours, 150G transferred. And no surprise, pending compaction tasks are now at 30K. There are about 30K SStable transferred and I guess all of them needs to be compacted since we use LCS. My concern is that if we did nothing wrong, replacing a dead node will cause such a hugh back log of pending compaction. It might take a week to clear that off. And we have RF = 3, we still need to bring in the data for the other two replicates since we use pr for nodetool repair. It will take about 3 weeks to fully replace a 200G node using LCS? We tried everything we can to speed up the compaction and no luck. The only thing I can think of is to increase the default size of SSTable, so less number of compaction will be needed. Can I just change it in yaml and restart C* and it will correct itself? Any side effect? Since we are using SSD, a bit bigger SSD won't slow down the read too much, I suppose that is the main concern for bigger size of SSTable? I think 1.2 comes with parallel LC which should help the situation. But we are not going to upgrade for a little while. Did I miss anything? It might not be practical to use LCS for 200G node? But if we use Sized compaction, we need to have at least 400G for the HD...Although SSD is cheap now, still hard to convince the management. three replicates + double the Disk for compaction? that is 6 times of the real data size! Sorry for the long email. Any suggestion or advice? Thanks. -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: Cassandra User user@cassandra.apache.org Sent: Tuesday, January 29, 2013 12:59:42 PM Subject: Re: Cassandra pending compaction tasks keeps increasing * Will try it tomorrow. Do I need to restart server to change the log level? You can set it via JMX, and supposedly log4j is configured to watch the config file. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 29/01/2013, at 9:36 PM, Wei Zhu wz1...@yahoo.com wrote: blockquote Thanks for the reply. Here is some information: Do you have wide rows ? Are you seeing logging about Compacting wide rows ? * I don't see any log about wide rows Are you seeing GC activity logged or seeing CPU steal on a VM ? * There is some GC, but CPU general is under 20%. We have heap size of 8G, RAM is at 72G. Have you tried disabling multithreaded_compaction ? * By default, it's disabled. We enabled it, but doesn't see much difference. Even a little slower with it's enabled. Is it bad to enable it? We have SSD, according to comment in yaml, it should help while using SSD. Are you using Key Caches ? Have you tried disabling compaction_preheat_key_cache? * We have fairly big Key caches, we set as 10% of Heap which is 800M. Yes, compaction_preheat_key_cache is disabled. Can you enabled DEBUG level logging and make them available ? * Will try it tomorrow. Do I need to restart server to change the log level? -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Monday, January 28, 2013 11:31:42 PM Subject: Re: Cassandra pending compaction tasks keeps increasing * Why nodetool repair increases the data size that much? It's not likely that much data needs to be repaired. Will that happen for all the subsequent repair? Repair only detects differences in entire rows. If you have very wide rows then small differences in rows can result in a large amount of streaming. Streaming creates new SSTables on the receiving side, which then need to be compacted. So repair often results in compaction doing it's thing for a while. * How to make LCS run faster? After almost a day
Re: Cassandra pending compaction tasks keeps increasing
Thanks for the reply. Here is some information: Do you have wide rows ? Are you seeing logging about Compacting wide rows ? * I don't see any log about wide rows Are you seeing GC activity logged or seeing CPU steal on a VM ? * There is some GC, but CPU general is under 20%. We have heap size of 8G, RAM is at 72G. Have you tried disabling multithreaded_compaction ? * By default, it's disabled. We enabled it, but doesn't see much difference. Even a little slower with it's enabled. Is it bad to enable it? We have SSD, according to comment in yaml, it should help while using SSD. Are you using Key Caches ? Have you tried disabling compaction_preheat_key_cache? * We have fairly big Key caches, we set as 10% of Heap which is 800M. Yes, compaction_preheat_key_cache is disabled. Can you enabled DEBUG level logging and make them available ? * Will try it tomorrow. Do I need to restart server to change the log level? -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Monday, January 28, 2013 11:31:42 PM Subject: Re: Cassandra pending compaction tasks keeps increasing * Why nodetool repair increases the data size that much? It's not likely that much data needs to be repaired. Will that happen for all the subsequent repair? Repair only detects differences in entire rows. If you have very wide rows then small differences in rows can result in a large amount of streaming. Streaming creates new SSTables on the receiving side, which then need to be compacted. So repair often results in compaction doing it's thing for a while. * How to make LCS run faster? After almost a day, the LCS tasks only dropped by 1000. I am afraid it will never catch up. We set This is going to be tricky to diagnose, sorry for asking silly questions... Do you have wide rows ? Are you seeing logging about Compacting wide rows ? Are you seeing GC activity logged or seeing CPU steal on a VM ? Have you tried disabling multithreaded_compaction ? Are you using Key Caches ? Have you tried disabling compaction_preheat_key_cache? Can you enabled DEBUG level logging and make them available ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 29/01/2013, at 8:59 AM, Derek Williams de...@fyrie.net wrote: I could be wrong about this, but when repair is run, it isn't just values that are streamed between nodes, it's entire sstables. This causes a lot of duplicate data to be written which was already correct on the node, which needs to be compacted away. As for speeding it up, no idea. On Mon, Jan 28, 2013 at 12:16 PM, Wei Zhu wz1...@yahoo.com wrote: Any thoughts? Thanks. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Friday, January 25, 2013 10:09:37 PM Subject: Re: Cassandra pending compaction tasks keeps increasing To recap the problem, 1.1.6 on SSD, 5 nodes, RF = 3, one CF only. After data load, initially all 5 nodes have very even data size (135G, each). I ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since we set RF = 3. It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 7K each. Questions: * Why nodetool repair increases the data size that much? It's not likely that much data needs to be repaired. Will that happen for all the subsequent repair? * How to make LCS run faster? After almost a day, the LCS tasks only dropped by 1000. I am afraid it will never catch up. We set * compaction_throughput_mb_per_sec = 500 * multithreaded_compaction: true Both Disk and CPU util are less than 10%. I understand LCS is single threaded, any chance to speed it up? * We use default SSTable size as 5M, Will increase the size of SSTable help? What will happen if I change the setting after the data is loaded. Any suggestion is very much appreciated. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:46:04 PM Subject: Re: Cassandra pending compaction tasks keeps increasing I believe I am running into this one: https://issues.apache.org/jira/browse/CASSANDRA-4765 By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed in 1.1.7. - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:18:59 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Thanks Derek, in the cassandra-env.sh, it says # reduce the per-thread stack size to minimize the impact of Thrift # thread-per-client. (Best practice is for client connections to # be pooled anyway.) Only do so on Linux where it is known
Re: Cassandra pending compaction tasks keeps increasing
Any thoughts? Thanks. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Friday, January 25, 2013 10:09:37 PM Subject: Re: Cassandra pending compaction tasks keeps increasing To recap the problem, 1.1.6 on SSD, 5 nodes, RF = 3, one CF only. After data load, initially all 5 nodes have very even data size (135G, each). I ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since we set RF = 3. It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 7K each. Questions: * Why nodetool repair increases the data size that much? It's not likely that much data needs to be repaired. Will that happen for all the subsequent repair? * How to make LCS run faster? After almost a day, the LCS tasks only dropped by 1000. I am afraid it will never catch up. We set * compaction_throughput_mb_per_sec = 500 * multithreaded_compaction: true Both Disk and CPU util are less than 10%. I understand LCS is single threaded, any chance to speed it up? * We use default SSTable size as 5M, Will increase the size of SSTable help? What will happen if I change the setting after the data is loaded. Any suggestion is very much appreciated. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:46:04 PM Subject: Re: Cassandra pending compaction tasks keeps increasing I believe I am running into this one: https://issues.apache.org/jira/browse/CASSANDRA-4765 By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed in 1.1.7. - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:18:59 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Thanks Derek, in the cassandra-env.sh, it says # reduce the per-thread stack size to minimize the impact of Thrift # thread-per-client. (Best practice is for client connections to # be pooled anyway.) Only do so on Linux where it is known to be # supported. # u34 and greater need 180k JVM_OPTS=$JVM_OPTS -Xss180k What value should I use? Java defaults at 400K? Maybe try that first. Thanks. -Wei - Original Message - From: Derek Williams de...@fyrie.net To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Thursday, January 24, 2013 11:06:00 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Increasing the stack size in cassandra-env.sh should help you get past the stack overflow. Doesn't help with your original problem though. On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu wz1...@yahoo.com wrote: Well, even after restart, it throws the the same exception. I am basically stuck. Any suggestion to clear the pending compaction tasks? Below is the end of stack trace: at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$3.iterator(Sets.java:667) at com.google.common.collect.Sets$3.size(Sets.java:670) at com.google.common.collect.Iterables.size(Iterables.java:80) at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) at org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Any suggestion is very much appreciated -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 10:55:07 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Do you mean 90% of the reads should come from 1 SSTable? By the way, after I finished the data migrating, I ran nodetool repair -pr on one of the nodes. Before nodetool repair, all the nodes have the same disk space usage. After I ran the nodetool repair, the disk space for that node jumped from 135G to 220G, also there are more than 15000 pending compaction tasks. After
Re: Cassandra pending compaction tasks keeps increasing
Two fundamental questions: * Why did nodetool repairs bring so much data. A lot of SSTables are created, disk space almost doubled. * Why does level compactions run so slow? We turned off throtting completely and don't see much utilization of the SSD and CPU. One example, 0.7MB/s on SSD? That is insane. Anything I can do to speed it up? * 1,837,023,925 to 1,836,694,446 (~99% of original) bytes for 1,686,604 keys at 0.717223MB/s. Time: 2,442,208ms. Thanks. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Monday, January 28, 2013 11:16:47 AM Subject: Re: Cassandra pending compaction tasks keeps increasing Any thoughts? Thanks. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Friday, January 25, 2013 10:09:37 PM Subject: Re: Cassandra pending compaction tasks keeps increasing To recap the problem, 1.1.6 on SSD, 5 nodes, RF = 3, one CF only. After data load, initially all 5 nodes have very even data size (135G, each). I ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since we set RF = 3. It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 7K each. Questions: * Why nodetool repair increases the data size that much? It's not likely that much data needs to be repaired. Will that happen for all the subsequent repair? * How to make LCS run faster? After almost a day, the LCS tasks only dropped by 1000. I am afraid it will never catch up. We set * compaction_throughput_mb_per_sec = 500 * multithreaded_compaction: true Both Disk and CPU util are less than 10%. I understand LCS is single threaded, any chance to speed it up? * We use default SSTable size as 5M, Will increase the size of SSTable help? What will happen if I change the setting after the data is loaded. Any suggestion is very much appreciated. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:46:04 PM Subject: Re: Cassandra pending compaction tasks keeps increasing I believe I am running into this one: https://issues.apache.org/jira/browse/CASSANDRA-4765 By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed in 1.1.7. - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:18:59 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Thanks Derek, in the cassandra-env.sh, it says # reduce the per-thread stack size to minimize the impact of Thrift # thread-per-client. (Best practice is for client connections to # be pooled anyway.) Only do so on Linux where it is known to be # supported. # u34 and greater need 180k JVM_OPTS=$JVM_OPTS -Xss180k What value should I use? Java defaults at 400K? Maybe try that first. Thanks. -Wei - Original Message - From: Derek Williams de...@fyrie.net To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Thursday, January 24, 2013 11:06:00 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Increasing the stack size in cassandra-env.sh should help you get past the stack overflow. Doesn't help with your original problem though. On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu wz1...@yahoo.com wrote: Well, even after restart, it throws the the same exception. I am basically stuck. Any suggestion to clear the pending compaction tasks? Below is the end of stack trace: at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$3.iterator(Sets.java:667) at com.google.common.collect.Sets$3.size(Sets.java:670) at com.google.common.collect.Iterables.size(Iterables.java:80) at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) at org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source
Re: Cassandra pending compaction tasks keeps increasing
To recap the problem, 1.1.6 on SSD, 5 nodes, RF = 3, one CF only. After data load, initially all 5 nodes have very even data size (135G, each). I ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since we set RF = 3. It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 7K each. Questions: * Why nodetool repair increases the data size that much? It's not likely that much data needs to be repaired. Will that happen for all the subsequent repair? * How to make LCS run faster? After almost a day, the LCS tasks only dropped by 1000. I am afraid it will never catch up. We set * compaction_throughput_mb_per_sec = 500 * multithreaded_compaction: true Both Disk and CPU util are less than 10%. I understand LCS is single threaded, any chance to speed it up? * We use default SSTable size as 5M, Will increase the size of SSTable help? What will happen if I change the setting after the data is loaded. Any suggestion is very much appreciated. -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:46:04 PM Subject: Re: Cassandra pending compaction tasks keeps increasing I believe I am running into this one: https://issues.apache.org/jira/browse/CASSANDRA-4765 By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed in 1.1.7. - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 11:18:59 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Thanks Derek, in the cassandra-env.sh, it says # reduce the per-thread stack size to minimize the impact of Thrift # thread-per-client. (Best practice is for client connections to # be pooled anyway.) Only do so on Linux where it is known to be # supported. # u34 and greater need 180k JVM_OPTS=$JVM_OPTS -Xss180k What value should I use? Java defaults at 400K? Maybe try that first. Thanks. -Wei - Original Message - From: Derek Williams de...@fyrie.net To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Thursday, January 24, 2013 11:06:00 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Increasing the stack size in cassandra-env.sh should help you get past the stack overflow. Doesn't help with your original problem though. On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu wz1...@yahoo.com wrote: Well, even after restart, it throws the the same exception. I am basically stuck. Any suggestion to clear the pending compaction tasks? Below is the end of stack trace: at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$3.iterator(Sets.java:667) at com.google.common.collect.Sets$3.size(Sets.java:670) at com.google.common.collect.Iterables.size(Iterables.java:80) at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) at org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Any suggestion is very much appreciated -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 10:55:07 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Do you mean 90% of the reads should come from 1 SSTable? By the way, after I finished the data migrating, I ran nodetool repair -pr on one of the nodes. Before nodetool repair, all the nodes have the same disk space usage. After I ran the nodetool repair, the disk space for that node jumped from 135G to 220G, also there are more than 15000 pending compaction tasks. After a while , Cassandra started to throw the exception like below and stop compacting. I had to restart the node. By the way, we are using 1.1.7. Something doesn't seem right. INFO [CompactionExecutor:108804] 2013-01-24 22:23
Does setstreamthroughput also throttle the network traffic caused by nodetool repair?
In the yaml, it has the following setting # Throttles all outbound streaming file transfers on this node to the # given total throughput in Mbps. This is necessary because Cassandra does # mostly sequential IO when streaming data during bootstrap or repair, which # can lead to saturating the network connection and degrading rpc performance. # When unset, the default is 400 Mbps or 50 MB/s. # stream_throughput_outbound_megabits_per_sec: 400 Is this the same value as if I call Nodetool setstreamthroughput Should I call it to all the nodes on the cluster? Will that throttle the network traffic caused by nodetool repair? Thanks. -Wei
Re: Cassandra pending compaction tasks keeps increasing
Do you mean 90% of the reads should come from 1 SSTable? By the way, after I finished the data migrating, I ran nodetool repair -pr on one of the nodes. Before nodetool repair, all the nodes have the same disk space usage. After I ran the nodetool repair, the disk space for that node jumped from 135G to 220G, also there are more than 15000 pending compaction tasks. After a while , Cassandra started to throw the exception like below and stop compacting. I had to restart the node. By the way, we are using 1.1.7. Something doesn't seem right. INFO [CompactionExecutor:108804] 2013-01-24 22:23:10,427 CompactionTask.java (line 109) Compacting [SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-753782-Data.db')] INFO [CompactionExecutor:108804] 2013-01-24 22:23:11,610 CompactionTask.java (line 221) Compacted to [/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754996-Data.db,]. 5,259,403 to 5,259,403 (~100% of original) bytes for 1,983 keys at 4.268730MB/s. Time: 1,175ms. INFO [CompactionExecutor:108805] 2013-01-24 22:23:11,617 CompactionTask.java (line 109) Compacting [SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754880-Data.db')] INFO [CompactionExecutor:108805] 2013-01-24 22:23:12,828 CompactionTask.java (line 221) Compacted to [/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754997-Data.db,]. 5,272,746 to 5,272,746 (~100% of original) bytes for 1,941 keys at 4.152339MB/s. Time: 1,211ms. ERROR [CompactionExecutor:108806] 2013-01-24 22:23:13,048 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[CompactionExecutor:108806,1,main] java.lang.StackOverflowError at java.util.AbstractList$Itr.hasNext(Unknown Source) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517) at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517) at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517) at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517) at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114) - Original Message - From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Wednesday, January 23, 2013 2:40:45 PM Subject: Re: Cassandra pending compaction tasks keeps increasing The histogram does not look right to me, too many SSTables for an LCS CF. It's a symptom no a cause. If LCS is catching up though it should be more like the distribution in the linked article. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 23/01/2013, at 10:57 AM, Jim Cistaro jcist...@netflix.com wrote: What version are you using? Are you seeing any compaction related assertions in the logs? Might be https://issues.apache.org/jira/browse/CASSANDRA-4411 We experienced this problem of the count only decreasing to a certain number and then stopping. If you are idle, it should go to 0. I have not seen it overestimate for zero, only for non-zero amounts. As for timeouts etc, you will need to look at things like nodetool tpstats to see if you have pending transactions queueing up. Jc From: Wei Zhu wz1...@yahoo.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org , Wei Zhu wz1...@yahoo.com Date: Tuesday, January 22, 2013 12:56 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Cassandra pending compaction tasks keeps increasing Thanks Aaron and Jim for your reply. The data import is done. We have about 135G on each node and it's about 28K SStables. For normal operation, we only have about 90 writes per seconds, but when I ran nodetool compationstats, it remains at 9 and hardly changes. I guess it's just an estimated number. When I ran histogram, Offset SSTables Write Latency Read Latency Row Size Column Count 1 2644 0 0 0 18660057 2 8204 0 0 0 9824270 3 11198 0 0 0 6968475 4 4269 6 0 0 5510745 5 517 29 0 0 4595205 You can see about half of the reads result in 3 SSTables. Majority of read latency are under 5ms, only a dozen are over 10ms. We haven't fully turn on reads yet, only 60 reads per second. We see about 20 read timeout during the past 12 hours. Not a single warning from Cassandra Log. Is it normal for Cassandra to timeout some requests? We set rpc timeout to be 1s, it shouldn't time out any of them? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Monday, January 21, 2013 12:21 AM Subject: Re: Cassandra pending compaction tasks keeps increasing The main guarantee LCS gives you is that most reads will only touch 1
Re: Cassandra pending compaction tasks keeps increasing
Thanks Derek, in the cassandra-env.sh, it says # reduce the per-thread stack size to minimize the impact of Thrift # thread-per-client. (Best practice is for client connections to # be pooled anyway.) Only do so on Linux where it is known to be # supported. # u34 and greater need 180k JVM_OPTS=$JVM_OPTS -Xss180k What value should I use? Java defaults at 400K? Maybe try that first. Thanks. -Wei - Original Message - From: Derek Williams de...@fyrie.net To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Sent: Thursday, January 24, 2013 11:06:00 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Increasing the stack size in cassandra-env.sh should help you get past the stack overflow. Doesn't help with your original problem though. On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu wz1...@yahoo.com wrote: Well, even after restart, it throws the the same exception. I am basically stuck. Any suggestion to clear the pending compaction tasks? Below is the end of stack trace: at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$1.iterator(Sets.java:578) at com.google.common.collect.Sets$3.iterator(Sets.java:667) at com.google.common.collect.Sets$3.size(Sets.java:670) at com.google.common.collect.Iterables.size(Iterables.java:80) at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) at org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Any suggestion is very much appreciated -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, January 24, 2013 10:55:07 PM Subject: Re: Cassandra pending compaction tasks keeps increasing Do you mean 90% of the reads should come from 1 SSTable? By the way, after I finished the data migrating, I ran nodetool repair -pr on one of the nodes. Before nodetool repair, all the nodes have the same disk space usage. After I ran the nodetool repair, the disk space for that node jumped from 135G to 220G, also there are more than 15000 pending compaction tasks. After a while , Cassandra started to throw the exception like below and stop compacting. I had to restart the node. By the way, we are using 1.1.7. Something doesn't seem right. INFO [CompactionExecutor:108804] 2013-01-24 22:23:10,427 CompactionTask.java (line 109) Compacting [SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-753782-Data.db')] INFO [CompactionExecutor:108804] 2013-01-24 22:23:11,610 CompactionTask.java (line 221) Compacted to [/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754996-Data.db,]. 5,259,403 to 5,259,403 (~100% of original) bytes for 1,983 keys at 4.268730MB/s. Time: 1,175ms. INFO [CompactionExecutor:108805] 2013-01-24 22:23:11,617 CompactionTask.java (line 109) Compacting [SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754880-Data.db')] INFO [CompactionExecutor:108805] 2013-01-24 22:23:12,828 CompactionTask.java (line 221) Compacted to [/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754997-Data.db,]. 5,272,746 to 5,272,746 (~100% of original) bytes for 1,941 keys at 4.152339MB/s. Time: 1,211ms. ERROR [CompactionExecutor:108806] 2013-01-24 22:23:13,048
Re: Is this how to read the output of nodetool cfhistograms?
I agree that Cassandra cfhistograms is probably the most bizarre metrics I have ever come across although it's extremely useful. I believe the offset is actually the metrics it has tracked (x-axis on the traditional histogram) and the number under each column is how many times that value has been recorded (y-axis on the traditional histogram). Your write latency are 17, 20, 24 (microseconds?). 3 writes took 17, 7 writes took 20 and 19 writes took 24 Correct me if I am wrong. Thanks. -Wei From: Brian Tarbox tar...@cabotresearch.com To: user@cassandra.apache.org Sent: Tuesday, January 22, 2013 7:27 AM Subject: Re: Is this how to read the output of nodetool cfhistograms? Indeed, but how many Cassandra users have the good fortune to stumble across that page? Just saying that the explanation of the very powerful nodetool commands should be more front and center. Brian On Tue, Jan 22, 2013 at 10:03 AM, Edward Capriolo edlinuxg...@gmail.com wrote: This was described in good detail here: http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ On Tue, Jan 22, 2013 at 9:41 AM, Brian Tarbox tar...@cabotresearch.com wrote: Thank you! Since this is a very non-standard way to display data it might be worth a better explanation in the various online documentation sets. Thank you again. Brian On Tue, Jan 22, 2013 at 9:19 AM, Mina Naguib mina.nag...@adgear.com wrote: On 2013-01-22, at 8:59 AM, Brian Tarbox tar...@cabotresearch.com wrote: The output of this command seems to make no sense unless I think of it as 5 completely separate histograms that just happen to be displayed together. Using this example output should I read it as: my reads all took either 1 or 2 sstable. And separately, I had write latencies of 3,7,19. And separately I had read latencies of 2, 8,69, etc? In other words...each row isn't really a row...i.e. on those 16033 reads from a single SSTable I didn't have 0 write latency, 0 read latency, 0 row size and 0 column count. Is that right? Correct. A number in any of the metric columns is a count value bucketed in the offset on that row. There are no relationships between other columns on the same row. So your first row says 16033 reads were satisfied by 1 sstable. The other metrics (for example, latency of these reads) is reflected in the histogram under Read Latency, under various other bucketed offsets. Offset SSTables Write Latency Read Latency Row Size Column Count 1 16033 0 0 0 0 2 303 0 0 0 1 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0 6 0 0 0 0 0 7 0 0 0 0 0 8 0 0 2 0 0 10 0 0 0 0 6261 12 0 0 2 0 117 14 0 0 8 0 0 17 0 3 69 0 255 20 0 7 163 0 0 24 0 19 1369 0 0
Re: Cassandra pending compaction tasks keeps increasing
Thanks Aaron and Jim for your reply. The data import is done. We have about 135G on each node and it's about 28K SStables. For normal operation, we only have about 90 writes per seconds, but when I ran nodetool compationstats, it remains at 9 and hardly changes. I guess it's just an estimated number. When I ran histogram, Offset SSTables Write Latency Read Latency Row Size Column Count 1 2644 0 0 0 18660057 2 8204 0 0 0 9824270 3 11198 0 0 0 6968475 4 4269 6 0 0 5510745 5 517 29 0 0 4595205 You can see about half of the reads result in 3 SSTables. Majority of read latency are under 5ms, only a dozen are over 10ms. We haven't fully turn on reads yet, only 60 reads per second. We see about 20 read timeout during the past 12 hours. Not a single warning from Cassandra Log. Is it normal for Cassandra to timeout some requests? We set rpc timeout to be 1s, it shouldn't time out any of them? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Monday, January 21, 2013 12:21 AM Subject: Re: Cassandra pending compaction tasks keeps increasing The main guarantee LCS gives you is that most reads will only touch 1 row http://www.datastax.com/dev/blog/when-to-use-leveled-compaction If compaction is falling behind this may not hold. nodetool cfhistograms tells you how many SSTables were read from for reads. It's a recent histogram that resets each time you read from it. Also, parallel levelled compaction in 1.2 http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/01/2013, at 7:49 AM, Jim Cistaro jcist...@netflix.com wrote: 1) In addition to iostat, dstat is a good tool to see wht kind of disck throuput your are getting. That would be one thing to monitor. 2) For LCS, we also see pending compactions skyrocket. During load, LCS will create a lot of small sstables which will queue up for compaction. 3) For us the biggest concern is not how high the pending count gets, but how often it gets back down near zero. If your load is something you can do in segments or pause, then you can see how fast the cluster recovers on the compactions. 4) One thing which we tune per cluster is the size of the files. Increasing this from 5MB can sometimes improve things. But I forget if we have ever changed this after starting data load. Is your cluster receiving read traffic during this data migration? If so, I would say that read latency is your best measure. If the high number of SSTables waiting to compact is not hurting your reads, then you are probably ok. Since you are on SSD, there is a good chance the compactions are not hurting you. As for compactionthroughput, we set ours high for SSD. You usually wont use it all because the compactions are usually single threaded. Dstat will help you measure this. I hope this helps, jc From: Wei Zhu wz1...@yahoo.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Date: Friday, January 18, 2013 12:10 PM To: Cassandr usergroup user@cassandra.apache.org Subject: Cassandra pending compaction tasks keeps increasing Hi, When I run nodetool compactionstats I see the number of pending tasks keep going up steadily. I tried to increase the compactionthroughput, by using nodetool setcompactionthroughput I even tried the extreme to set it to 0 to disable the throttling. I checked iostats and we have SSD for data, the disk util is less than 5% which means it's not I/O bound, CPU is also less than 10% We are using levelcompaction and in the process of migrating data. We have 4500 writes per second and very few reads. We have about 70G data now and will grow to 150G when the migration finishes. We only have one CF and right now the number of SSTable is around 15000, write latency is still under 0.1ms. Anything needs to be concerned? Or anything I can do to reduce the number of pending compaction? Thanks. -Wei
Cassandra pending compaction tasks keeps increasing
Hi, When I run nodetool compactionstats I see the number of pending tasks keep going up steadily. I tried to increase the compactionthroughput, by using nodetool setcompactionthroughput I even tried the extreme to set it to 0 to disable the throttling. I checked iostats and we have SSD for data, the disk util is less than 5% which means it's not I/O bound, CPU is also less than 10% We are using levelcompaction and in the process of migrating data. We have 4500 writes per second and very few reads. We have about 70G data now and will grow to 150G when the migration finishes. We only have one CF and right now the number of SSTable is around 15000, write latency is still under 0.1ms. Anything needs to be concerned? Or anything I can do to reduce the number of pending compaction? Thanks. -Wei
Re: How many BATCH inserts in to many?
Another potential issue is when some failure happens to some of the mutations. Is atomic batches in 1.2 designed to resolve this? http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2 -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Sunday, January 13, 2013 7:57:56 PM Subject: Re: How many BATCH inserts in to many? With regard to a large number of records in a batch mutation there are some potential issues. Each row becomes a task in the write thread pool on each replica. If a single client sends 1,000 rows in a mutation it will take time for the (default) 32 threads in the write pool to work through the mutations. While they are doing this other clients / requests will appear to be starved / stalled. There are also issues with the max message size in thrift and cql over thrift. IMHO as a rule of thumb dont go over a few hundred if you have a high number of concurrent writers. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 14/01/2013, at 12:56 AM, Radim Kolar h...@filez.com wrote: do not use cassandra for implementing queueing system with high throughput. It does not scale because of tombstone management. Use hornetQ, its amazingly fast broker but it has quite slow persistence if you want to create queues significantly larger then your memory and use selectors for searching for specific messages in them. My point is for implementing queue message broker is what you want.
Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra
I tried to registered and got the following page and haven't received email yet. I registered 10 minutes ago. Thank you for registering to attend: Is My App a Good Fit for Apache Cassandra? Details about this webinar have also been sent to your email, including a link to the webinar's URL. Webinar Description: Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra as he examines the types of applications that are suited to be built on top of Cassandra. Eric will talk about the key considerations for designing and deploying your application on Apache Cassandra. How come it's saying Is My App a Good Fit for Apache Cassandra? which was the previous webniar. Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Thursday, December 13, 2012 7:23 AM Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra It should be good stuff. Brian eats this stuff for lunch. On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote: FWIW -- I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series: http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html I hope to make CQL part of the presentation and show how it integrates with the Java APIs. If you are interested, drop in. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra
Never mind, the email arrived after 15 minutes or so... From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Thursday, December 13, 2012 10:06 AM Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra I tried to registered and got the following page and haven't received email yet. I registered 10 minutes ago. Thank you for registering to attend: Is My App a Good Fit for Apache Cassandra? Details about this webinar have also been sent to your email, including a link to the webinar's URL. Webinar Description: Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra as he examines the types of applications that are suited to be built on top of Cassandra. Eric will talk about the key considerations for designing and deploying your application on Apache Cassandra. How come it's saying Is My App a Good Fit for Apache Cassandra? which was the previous webniar. Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Thursday, December 13, 2012 7:23 AM Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra It should be good stuff. Brian eats this stuff for lunch. On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote: FWIW -- I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series: http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html I hope to make CQL part of the presentation and show how it integrates with the Java APIs. If you are interested, drop in. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
multiget_slice SlicePredicate
I know it's probably not a good idea to use multiget, but for my use case, it's the only choice, I have question regarding the SlicePredicate argument of the multiget_slice The SlicePredicate takes slice_range which takes start, end and range. I suppose start and end will apply to each individual row. How about range, is it a accumulative column count of all the rows or to the individual row? If I set range to 100, is it 100 columns per row, or total? Thanks for you reply, -Wei multiget_slice * mapstring,listColumnOrSuperColumn multiget_slice(listbinary keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
Re: multiget_slice SlicePredicate
Well, not sure how parallel is multiget. Someone is saying it's in parallel sending requests to the different nodes and on each node it's executed sequentially. I didn't bother looking into the source code yet. Anyone knows it for sure? I am using Hector, just copied the thrift definition from Cassandra site for reference. You are right, the count is for each individual row. Thanks. -Wei From: Hiller, Dean dean.hil...@nrel.gov To: user@cassandra.apache.org user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Monday, December 10, 2012 1:13 PM Subject: Re: multiget_slice SlicePredicate What's wrong with multiget…parallel performance is great from multiple disks and so usually that is a good thing. Also, something looks wrong, since you have listbinary keys, I would expect the Map to be Mapbinary, listColumnOrSuperColumn Are you sure you have that correct? IF you set range to 100, it should be 100 columns each row but it never hurts to run the code and verify. Later, Dean PlayOrm Developer From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Date: Monday, December 10, 2012 2:07 PM To: Cassandr usergroup user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: multiget_slice SlicePredicate I know it's probably not a good idea to use multiget, but for my use case, it's the only choice, I have question regarding the SlicePredicate argument of the multiget_slice The SlicePredicate takes slice_range which takes start, end and range. I suppose start and end will apply to each individual row. How about range, is it a accumulative column count of all the rows or to the individual row? If I set range to 100, is it 100 columns per row, or total? Thanks for you reply, -Wei multiget_slice * mapstring,listColumnOrSuperColumn multiget_slice(listbinary keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.
I think Aaron meant 300-400GB instead of 300-400MB. Thanks. -Wei - Original Message - From: Wade L Poziombka wade.l.poziom...@intel.com To: user@cassandra.apache.org Sent: Thursday, December 6, 2012 6:53:53 AM Subject: RE: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction. “ Having so much data on each node is a potential bad day.” Is this discussed somewhere on the Cassandra documentation (limits, practices etc)? We are also trying to load up quite a lot of data and have hit memory issues (bloom filter etc.) in 1.0.10. I would like to read up on big data usage of Cassandra. Meaning terabyte size databases. I do get your point about the amount of time required to recover downed node. But this 300-400MB business is interesting to me. Thanks in advance. Wade From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, December 05, 2012 9:23 PM To: user@cassandra.apache.org Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction. Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected! I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking. But on the 3rd node, we suspect major compaction didn't actually finish it's job… The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started. 8GB heap The default is 4GB max now days. 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? I cannot answer that. 2) Should we restart with leveled compaction next year? I would run some tests to see how it works for you workload. 4) Should we consider increasing the cluster capacity? IMHO yes. You may also want to do some experiments with turing compression on if it not already enabled. Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ? With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 6/12/2012, at 3:40 AM, Alexandru Sicoe adsi...@gmail.com wrote: Hi guys, Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings. Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected! But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy. The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks). Questions: 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? [Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space]. 2) Should we restart with leveled compaction next year? [Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files = to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!] 3) In case we keep SizeTiered: - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions? - Do we still need to run regular repair operations as well? Do these also do a major compaction or are they completely separate operations? [Note:
Rename cluster
Hi, I am trying to rename a cluster by following the instruction on Wiki: Cassandra says ClusterName mismatch: oldClusterName != newClusterName and refuses to start To prevent operator errors, Cassandra stores the name of the cluster in its system table. If you need to rename a cluster for some reason, you can: Perform these steps on each node: 1. Start the cassandra-cli connected locally to this node. 2. Run the following: 1. use system; 2. set LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('new cluster name'); 3. exit; 3. Run nodetool flush on this node. 4. Update the cassandra.yaml file for the cluster_name as the same as 2b). 5. Restart the node. Once all nodes have been had this operation performed and restarted, nodetool ring should show all nodes as UP. Get the following error: Connected to: Test Cluster on 10.200.128.151/9160 Welcome to Cassandra CLI version 1.1.6 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use system; Authenticated to keyspace: system [default@system] set LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('General Services Cluster'); system keyspace is not user-modifiable. InvalidRequestException(why:system keyspace is not user-modifiable.) at org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:15974) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:797) at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:781) at org.apache.cassandra.cli.CliClient.executeSet(CliClient.java:909) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:222) at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:219) at org.apache.cassandra.cli.CliMain.main(CliMain.java:346) I have to remove the data directory in order to change the cluster name. Luckily it's my testing box, so no harm. Just wondering what has been changed not to allow the modification through cli? What is the way of changing the cluster name without wiping out all the data now? Thanks. -Wei
Re: Java high-level client
We are using Hector now. What is the major advantage of astyanax over Hector? Thanks. -Wei From: Andrey Ilinykh ailin...@gmail.com To: user@cassandra.apache.org Sent: Wednesday, November 28, 2012 9:37 AM Subject: Re: Java high-level client +1 On Tue, Nov 27, 2012 at 10:10 AM, Michael Kjellman mkjell...@barracuda.com wrote: Netflix has a great client https://github.com/Netflix/astyanax
Re: Java high-level client
Astyanax was the son of Hector who was Cassandra's brother in greek mythology. So son is doing better than the father:) -Wei From: Michael Kjellman mkjell...@barracuda.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Wednesday, November 28, 2012 11:51 AM Subject: Re: Java high-level client Lots of example code, nice api, good performance as the first things that come to mind why I like Astyanax better than Hector From: Andrey Ilinykh ailin...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Wednesday, November 28, 2012 11:49 AM To: user@cassandra.apache.org user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Subject: Re: Java high-level client First at all, it is backed by Netflix. They used it production for long time, so it is pretty solid. Also they have nice tool (Priam) which makes cassandra cloud (AWS) friendly. This is important for us. Andrey On Wed, Nov 28, 2012 at 11:53 AM, Wei Zhu wz1...@yahoo.com wrote: We are using Hector now. What is the major advantage of astyanax over Hector? Thanks. -Wei From: Andrey Ilinykh ailin...@gmail.com To: user@cassandra.apache.org Sent: Wednesday, November 28, 2012 9:37 AM Subject: Re: Java high-level client +1 On Tue, Nov 27, 2012 at 10:10 AM, Michael Kjellman mkjell...@barracuda.com wrote: Netflix has a great client https://github.com/Netflix/astyanax -- 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Re: Java high-level client
FYI, We are using Hector 1.0-5 which comes with cassandra-thrift 1.09 - libthrift 0.6.1. It can work with Cassandra 1.1.6. Totally agree it's a pain to deal with different version of libthrift. We use scribe for logging, a bit messy over there. Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, November 27, 2012 8:57 AM Subject: Re: Java high-level client Hector does not require an outdated version of thift, you are likely using an outdated version of hector. Here is the long and short of it: If the thrift thrift API changes then hector can have compatibility issues. This happens from time to time. The main methods like get() and insert() have remained the same, but the CFMetaData objects have changed. (this causes the incompatible class stuff you are seeing). CQLhas a different version of the same problem, the CQL syntax is version-ed. For example, if you try to execute a CQL3 query as a CQL2query it will likely fail. In the end your code still has to be version aware. With hector you get a compile time problem, with pure CQL you get a runtime problem. I have always had the opinion the project should have shipped hector with Cassandra, this would have made it obvious what version is likely to work. The new CQL transport client is not being shipped with Cassandra either, so you will still have to match up the versions. Although they should be largely compatible some time in the near or far future one of the clients probably wont work with one of the servers. Edward On Tue, Nov 27, 2012 at 11:10 AM, Michael Kjellman mkjell...@barracuda.com wrote: Netflix has a great client https://github.com/Netflix/astyanax On 11/27/12 7:40 AM, Peter Lin wool...@gmail.com wrote: I use hector-client master, which is pretty stable right now. It uses the latest thrift, so you can use hector with thrift 0.9.0. That's assuming you don't mind using the active development branch. On Tue, Nov 27, 2012 at 10:36 AM, Carsten Schnober schno...@ids-mannheim.de wrote: Hi, I'm aware that this has been a frequent question, but answers are still hard to find: what's an appropriate Java high-level client? I actually believe that the lack of a single maintained Java API that is packaged with Cassandra is quite an issue. The way the situation is right now, new users have to pick more or less randomly one of the available options from the Cassandra Wiki and find a suitable solution for their individual requirements through trial implementations. This can cause and lot of wasted time (and frustration). Personally, I've played with Hector before figuring out that it seems to require an outdated Thrift version. Downgrading to Thrift 0.6 is not an option for me though because I use Thrift 0.9.0 in other classes of the same project. So I've had a look at Kundera and at Easy-Cassandra. Both seem to lack a real documentation beyond the examples available in their Github repositories, right? Can more experienced users recommend either one of the two or some of the other options listed at the Cassandra Wiki? I know that this strongly depends on individual requirements, but all I need are simple requests for very basic queries. So I would like to emphasize the importance a clear documentation and a stable and well-maintained API. Any hints? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Re: row cache re-fill very slow
Last time I checked, it took about 120 seconds to load up 21125 keys with total about 500M in memory ( We have a pretty wide row:). So it's about 4 MB/sec. Just curious Andras, how can you manage such a big row cache (10-15GB currently)? By recommendation, you will have 10% of your heap as row cache, so your heap is over 100G?? The largest datastax recommends is 8GB and it seems to be a hardcoded limit in cassandra-env.sh ( # calculate 1/4 ram and cap to 8192MB). Does you GC hold up with such a big heap? In my experience, full GC could take over 20 seconds for such a big heap. Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Monday, November 19, 2012 1:00 PM Subject: Re: row cache re-fill very slow i was just wondering if anyone else is experiencing very slow ( ~ 3.5 MB/sec ) re-fill of the row cache at start up. It was mentioned the other day. What version are you on ? Do you know how many rows were loaded ? When complete it will log a message with the pattern completed loading (%d ms; %d keys) row cache for %s.%s How is the saved row cache file processed? In Version 1.1, after the SSTables have been opened the keys in the saved row cache are read one at a time and the whole row read into memory. This is a single threaded operation. In 1.2 reading the saved cache is still single threaded, but reading the rows goes through the read thread pool so is in parallel. In both cases I do not believe the cache is stored in token (or key) order. ( Admittedly whatever is going on is still much more preferable to starting with a cold row cache ) row_cache_keys_to_save in yaml may help you find a happy half way point. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/11/2012, at 3:17 AM, Andras Szerdahelyi andras.szerdahe...@ignitionone.com wrote: Hey list, i was just wondering if anyone else is experiencing very slow ( ~ 3.5 MB/sec ) re-fill of the row cache at start up. We operate with a large row cache ( 10-15GB currently ) and we already measure startup times in hours :-) How is the saved row cache file processed? Are the cached row keys simply iterated over and their respective rows read from SSTables - possibly creating random reads with small enough sstable files, if the keys were not stored in a manner optimised for a quick re-fill ? - or is there a smarter algorithm ( i.e. scan through one sstable at a time, filter rows that should be in row cache ) at work and this operation is purely disk i/o bound ? ( Admittedly whatever is going on is still much more preferable to starting with a cold row cache ) thanks! Andras Andras Szerdahelyi Solutions Architect, IgnitionOne | 1831 Diegem E.Mommaertslaan 20A M: +32 493 05 50 88 | Skype: sandrew84 C4798BB9-9092-4145-880B-A72C6B7AF9A4[41].png
Re: unable to read saved rowcache from disk
Just curious why do you think row key will take 300 byte? If the row key is Long type, doesn't it take 8 bytes? In his case, the rowCache was 500M with 1.6M rows, so the row data is 300B. Did I miss something? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Thursday, November 15, 2012 12:15 PM Subject: Re: unable to read saved rowcache from disk For a row cache of 1,650,000: 16 byte token 300 byte row key ? and row data ? multiply by a java fudge factor or 5 or 10. Trying delete the saved cache and restarting. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/11/2012, at 8:20 PM, Wz1975 wz1...@yahoo.com wrote: Before shut down, you saw rowcache has 500m, 1.6m rows, each row average 300B, so 700k row should be a little over 200m, unless it is reading more, maybe tombstone? Or the rows on disk have grown for some reason, but row cache was not updated? Could be something else eats up the memory. You may profile memory and see who consumes the memory. Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: unable to read saved rowcache from disk From: Manu Zhang owenzhang1...@gmail.com To: user@cassandra.apache.org CC: 3G, other jvm parameters are unchanged. On Thu, Nov 15, 2012 at 2:40 PM, Wz1975 wz1...@yahoo.com wrote: How big is your heap? Did you change the jvm parameter? Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: unable to read saved rowcache from disk From: Manu Zhang owenzhang1...@gmail.com To: user@cassandra.apache.org CC: add a counter and print out myself On Thu, Nov 15, 2012 at 1:51 PM, Wz1975 wz1...@yahoo.com wrote: Curious where did you see this? Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: unable to read saved rowcache from disk From: Manu Zhang owenzhang1...@gmail.com To: user@cassandra.apache.org CC: OOM at deserializing 747321th row On Thu, Nov 15, 2012 at 9:08 AM, Manu Zhang owenzhang1...@gmail.com wrote: oh, as for the number of rows, it's 165. How long would you expect it to be read back? On Thu, Nov 15, 2012 at 3:57 AM, Wei Zhu wz1...@yahoo.com wrote: Good information Edward. For my case, we have good size of RAM (76G) and the heap is 8G. So I set the row cache to be 800M as recommended. Our column is kind of big, so the hit ratio for row cache is around 20%, so according to datastax, might just turn the row cache altogether. Anyway, for restart, it took about 2 minutes to load the row cache INFO [main] 2012-11-14 11:43:29,810 AutoSavingCache.java (line 108) reading saved cache /var/lib/cassandra/saved_caches/XXX-f2-RowCache INFO [main] 2012-11-14 11:45:12,612 ColumnFamilyStore.java (line 451) completed loading (102801 ms; 21125keys) row cache for XXX.f2 Just for comparison, our key is long, the disk usage for row cache is 253K. (it only stores key when row cache is saved to disk, so 253KB/ 8bytes = 31625number of keys). It's about right... So for 15MB, there could be a lot of narrow rows. (if the key is Long, could be more than 1M rows) Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, November 13, 2012 11:13 PM Subject: Re: unable to read saved rowcache from disk http://wiki.apache.org/cassandra/LargeDataSetConsiderations A negative side-effect of a large row-cache is start-up time. The periodic saving of the row cache information only saves the keys that are cached; the data has to be pre-fetched on start-up. On a large data set, this is probably going to be seek-bound and the time it takes to warm up the row cache will be linear with respect to the row cache size (assuming sufficiently large amounts of data that the seek bound I/O is not subject to optimization by disks) Assuming a row cache 15MB and the average row is 300 bytes, that could be 50,000 entries. 4 hours seems like a long time to read back 50K entries. Unless the source table was very large and you can only do a small number / reads/sec. On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com wrote: incorrect... what do you mean? I think it's only 15MB, which is not big. On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Yes the row cache could be incorrect so on startup cassandra verify they saved row cache by re reading. It takes a long time so do not save a big row cache. On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote: I have a rowcache provieded by SerializingCacheProvider. The data that has been read into it is about 500MB, as claimed by jconsole. After saving cache, it is around 15MB on disk. Hence, I
Re: unable to read saved rowcache from disk
Good information Edward. For my case, we have good size of RAM (76G) and the heap is 8G. So I set the row cache to be 800M as recommended. Our column is kind of big, so the hit ratio for row cache is around 20%, so according to datastax, might just turn the row cache altogether. Anyway, for restart, it took about 2 minutes to load the row cache INFO [main] 2012-11-14 11:43:29,810 AutoSavingCache.java (line 108) reading saved cache /var/lib/cassandra/saved_caches/XXX-f2-RowCache INFO [main] 2012-11-14 11:45:12,612 ColumnFamilyStore.java (line 451) completed loading (102801 ms; 21125keys) row cache for XXX.f2 Just for comparison, our key is long, the disk usage for row cache is 253K. (it only stores key when row cache is saved to disk, so 253KB/ 8bytes = 31625number of keys). It's about right... So for 15MB, there could be a lot of narrow rows. (if the key is Long, could be more than 1M rows) Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, November 13, 2012 11:13 PM Subject: Re: unable to read saved rowcache from disk http://wiki.apache.org/cassandra/LargeDataSetConsiderations A negative side-effect of a large row-cache is start-up time. The periodic saving of the row cache information only saves the keys that are cached; the data has to be pre-fetched on start-up. On a large data set, this is probably going to be seek-bound and the time it takes to warm up the row cache will be linear with respect to the row cache size (assuming sufficiently large amounts of data that the seek bound I/O is not subject to optimization by disks) Assuming a row cache 15MB and the average row is 300 bytes, that could be 50,000 entries. 4 hours seems like a long time to read back 50K entries. Unless the source table was very large and you can only do a small number / reads/sec. On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com wrote: incorrect... what do you mean? I think it's only 15MB, which is not big. On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Yes the row cache could be incorrect so on startup cassandra verify they saved row cache by re reading. It takes a long time so do not save a big row cache. On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote: I have a rowcache provieded by SerializingCacheProvider. The data that has been read into it is about 500MB, as claimed by jconsole. After saving cache, it is around 15MB on disk. Hence, I suppose the size from jconsole is before serializing. Now while restarting Cassandra, it's unable to read saved rowcache back. By unable, I mean around 4 hours and I have to abort it and remove cache so as not to suspend other tasks. Since the data aren't huge, why Cassandra can't read it back? My Cassandra is 1.2.0-beta2.
Re: read request distribution
I am new to Cassandra, and 1.1.6 is the only version I have tested. Not sure about the old behavior, for 1.1.6, my observation is that for brand new cluster (with no CF created), it shows Ownership from nodetool ring, the value is 100/nodes. As soon as one CF is created, the column changes to Effective Ownership and the formula seems to be 100*replication factor/nodes as Kirk mentioned. .Theoretically, different keyspace can have different replication factor. Not sure how Effective Ownership is calculated in that cases. Just curious anyone knows? Thanks. -Wei From: Kirk True k...@mustardgrain.com To: user@cassandra.apache.org Sent: Monday, November 12, 2012 4:24 PM Subject: Re: read request distribution Somewhat recently the Ownership column was changed to Effective Ownership. Previously the formula was essentially 100/nodes. Now it's 100*replication factor/nodes. So in previous releases of Cassandra it would be 100/12 = 8.33, now it would be closer to 25% (8.33*3 (assuming a replication factor of three)). Kirk On Mon, Nov 12, 2012, at 03:52 PM, Ananth Gundabattula wrote: Hi all, On an unrelated observation of the below readings, it looks like all the 3 nodes own 100% of the data. This confuses me a bit. We have a 12 node cluster with RF=3 but the effective ownership is shown as 8.33 % . So here is my question. How is the ownership calculated : Is Replica factor considered in the ownership calculation ? ( If yes , then 8.33 % ownership of a cluster seems wrong to me . If not 100% ownership for a node cluster seems wrong to me. Am I missing something in the calculation? Regards, Ananth On Fri, Nov 9, 2012 at 4:37 PM, Wei Zhu wz1...@yahoo.com wrote: Hi All, I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. I generated 6M rows with sequence number from 1 to 6m, so the rows should be evenly distributed among the three nodes disregarding the replicates. I am doing a benchmark with read only requests, I generate read request for randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that one node has only half the requests as the other one and the third node sits in the middle. So the ratio is like 2:3:4. The node with the most read requests actually has the smallest latency and the one with the least read requests reports the largest latency. The difference is pretty big, the fastest is almost double the slowest. All three nodes have the exactly the same hardware and the data size on each node are the same since the RF is three and all of them have the complete data. I am using Hector as client and the random read request are in millions. I can't think of a reasonable explanation. Can someone please shed some lights? Thanks. -Wei
Re: monitor cassandra 1.1.6 with MX4J
In my cassandra-env.sh for 1.1.6, there is no setting regarding mx4j at all. I simply dropped the mx4j jar to the lib folder and enable jmx from cassandra-env.sh, I can connect to the default mx4j port 8081 with no problem. I guess without the mx4j setting, it uses default port. If youi need to connect using other port, you might have to add the settings mentioned by Michal, I didn't try it myself though. Thanks. -Wei From: Michal Michalski mich...@opera.com To: user@cassandra.apache.org Sent: Monday, November 12, 2012 3:37 AM Subject: Re: monitor cassandra 1.1.6 with MX4J Hmm... It looks like it wasn't merged at some time (why?), because I can see that appropriate lines were present in a few branches. I didn't check if it works, but looking at git history tells me that you could try modifying cassandra-env.sh like this: Add this somewhere configure: # To use mx4j, an HTML interface for JMX, add mx4j-tools.jar to the lib/ directory. # By default mx4j listens on 0.0.0.0:8081. Uncomment the following lines to control # its listen address and port. MX4J_ADDRESS=-Dmx4jaddress=0.0.0.0 MX4J_PORT=-Dmx4jport=8081 And then add this: JVM_OPTS=$JVM_OPTS $MX4J_ADDRESS JVM_OPTS=$JVM_OPTS $MX4J_PORT somewhere in the end of the file where JVM_OPTS is appended. Regards, Michał W dniu 12.11.2012 12:03, Francisco Trujillo Hacha pisze: I am trying to monitor a single Cassandra node using: http://wiki.apache.org/cassandra/Operations#Monitoring_with_MX4J But I can't find the variables indicated (url and port) in the cassandra-env.sh Where coulde be?
Re: read request distribution
Thanks Tyler for the information. From the online document: QUORUM Returns the record with the most recent timestamp after a quorum of replicas has responded. It's hard to know that digest query will be sent to *one* other replica. When the node gets the request, does it become the coordinator since all the nodes have all the data in this setting? Or it will send the query to the primary node (the node who is in charge of token) and let that node be the coordinator? I would guess the latter is the case, otherwise it can't explain why the third node is always slower than the other two given the fact it's in charge of the wider columns than the other two. Thanks. -Wei From: Tyler Hobbs ty...@datastax.com To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Saturday, November 10, 2012 3:15 PM Subject: Re: read request distribution When you read at quorum, a normal read query will be sent to one replica (possibly the same node that's coordinating) and a digest query will be sent to *one* other replica, not both. Which replicas get picked for these is determined by the dynamic snitch, which will favor replicas that are responding with the lowest latency. That's why you'll see more queries going to replicas with lower latencies. The Read Count number in nodetool cfstats is for local reads, not coordination of a read request. On Fri, Nov 9, 2012 at 8:16 PM, Wei Zhu wz1...@yahoo.com wrote: I think the row whose row key falls into the token range of the high latency node is likely to have more columns than the other nodes. I have three nodes with RF = 3, so all the nodes have all the data. And CL = Quorum, meaning each request is sent to all three nodes and response is sent back to client when two of them respond. What exactly does Read Count from nodetool cfstats mean then, should it be the same across all the nodes? I checked with Hector, it uses Round Robin LB strategy. And I also tested writes, and the writes are distributed across the cluster evenly. Below is the output from nodetool. Any one has a clue what might happened? Node1: Read Count: 318679 Read Latency: 72.47641436367003 ms. Write Count: 158680 Write Latency: 0.07918750315099571 ms. Node 2: Read Count: 251079 Read Latency: 86.91948475579399 ms. Write Count: 158450 Write Latency: 0.1744694540864626 ms. Node 3: Read Count: 149876 Read Latency: 168.14125553123915 ms. Write Count: 157896 Write Latency: 0.06468631250949992 ms. nodetool ring Address DC Rack Status State Load Effective-Ownership Token 113427455640312821154458202477256070485 10.1.3.152 datacenter1 rack1 Up Normal 35.85 GB 100.00% 0 10.1.3.153 datacenter1 rack1 Up Normal 35.86 GB 100.00% 56713727820156410577229101238628035242 10.1.3.155 datacenter1 rack1 Up Normal 35.85 GB 100.00% 113427455640312821154458202477256070485 Keyspace: benchmark: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:3] I am really confused by the Read Count number from nodetool cfstats Really appreciate any hints.-Wei From: Wei Zhu wz1...@yahoo.com To: Cassandr usergroup user@cassandra.apache.org Sent: Thursday, November 8, 2012 9:37 PM Subject: read request distribution Hi All, I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. I generated 6M rows with sequence number from 1 to 6m, so the rows should be evenly distributed among the three nodes disregarding the replicates. I am doing a benchmark with read only requests, I generate read request for randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that one node has only half the requests as the other one and the third node sits in the middle. So the ratio is like 2:3:4. The node with the most read requests actually has the smallest latency and the one with the least read requests reports the largest latency. The difference is pretty big, the fastest is almost double the slowest. All three nodes have the exactly the same hardware and the data size on each node are the same since the RF is three and all of them have the complete data. I am using Hector as client and the random read request are in millions. I can't think of a reasonable explanation. Can someone please shed some lights? Thanks. -Wei -- Tyler Hobbs DataStax
Re: read request distribution
That is actually my original question. All three nodes have the complete data and all of them have the exactly the same hardware/software configuration and client uses RR to distribute the read request among the nodes, why one of them consistently report the much larger latency than the other two? Thanks for the explanation of QUORUM, it clears a lot of confusion. -Wei From: Tyler Hobbs ty...@datastax.com To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Monday, November 12, 2012 12:43 PM Subject: Re: read request distribution Whichever node gets the initial Thrift request from the client is always the coordinator; there's no concept of making another node the coordinator. As far as QUORUM goes, only two nodes need to give a response to meet the consistency level, so Cassandra only sends out two read requests: one data request, and one digest request (unless the request is selected for read repair, in which case it sends a digest request to two other nodes instead of one). If the coordinator happens to be a replica for the requested data, it will usually pick itself for the data request and send the digest query elsewhere. Since all of your nodes hold all of the data, I'm not sure what you're referring to when you say that it's in charge of the wider columns. Because of dynamic snitch behavior, you should expect to see the node with the highest latencies get the fewest queries, even if the high latency is partially *caused* by it not getting many queries (i.e. cold cache). On Mon, Nov 12, 2012 at 2:29 PM, Wei Zhu wz1...@yahoo.com wrote: Thanks Tyler for the information. From the online document: QUORUM Returns the record with the most recent timestamp after a quorum of replicas has responded. It's hard to know that digest query will be sent to *one* other replica. When the node gets the request, does it become the coordinator since all the nodes have all the data in this setting? Or it will send the query to the primary node (the node who is in charge of token) and let that node be the coordinator? I would guess the latter is the case, otherwise it can't explain why the third node is always slower than the other two given the fact it's in charge of the wider columns than the other two. Thanks.-Wei From: Tyler Hobbs ty...@datastax.com To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Saturday, November 10, 2012 3:15 PM Subject: Re: read request distribution When you read at quorum, a normal read query will be sent to one replica (possibly the same node that's coordinating) and a digest query will be sent to *one* other replica, not both. Which replicas get picked for these is determined by the dynamic snitch, which will favor replicas that are responding with the lowest latency. That's why you'll see more queries going to replicas with lower latencies. The Read Count number in nodetool cfstats is for local reads, not coordination of a read request. On Fri, Nov 9, 2012 at 8:16 PM, Wei Zhu wz1...@yahoo.com wrote: I think the row whose row key falls into the token range of the high latency node is likely to have more columns than the other nodes. I have three nodes with RF = 3, so all the nodes have all the data. And CL = Quorum, meaning each request is sent to all three nodes and response is sent back to client when two of them respond. What exactly does Read Count from nodetool cfstats mean then, should it be the same across all the nodes? I checked with Hector, it uses Round Robin LB strategy. And I also tested writes, and the writes are distributed across the cluster evenly. Below is the output from nodetool. Any one has a clue what might happened? Node1: Read Count: 318679 Read Latency: 72.47641436367003 ms. Write Count: 158680 Write Latency: 0.07918750315099571 ms. Node 2: Read Count: 251079 Read Latency: 86.91948475579399 ms. Write Count: 158450 Write Latency: 0.1744694540864626 ms. Node 3: Read Count: 149876 Read Latency: 168.14125553123915 ms. Write Count: 157896 Write Latency: 0.06468631250949992 ms. nodetool ring Address DC Rack Status State Load Effective-Ownership Token 113427455640312821154458202477256070485 10.1.3.152 datacenter1 rack1 Up Normal 35.85 GB 100.00% 0 10.1.3.153 datacenter1 rack1 Up Normal 35.86 GB 100.00% 56713727820156410577229101238628035242 10.1.3.155 datacenter1 rack1 Up Normal 35.85 GB 100.00% 113427455640312821154458202477256070485 Keyspace: benchmark: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options
Re: read request distribution
I think the row whose row key falls into the token range of the high latency node is likely to have more columns than the other nodes. I have three nodes with RF = 3, so all the nodes have all the data. And CL = Quorum, meaning each request is sent to all three nodes and response is sent back to client when two of them respond. What exactly does Read Count from nodetool cfstats mean then, should it be the same across all the nodes? I checked with Hector, it uses Round Robin LB strategy. And I also tested writes, and the writes are distributed across the cluster evenly. Below is the output from nodetool. Any one has a clue what might happened? Node1: Read Count: 318679 Read Latency: 72.47641436367003 ms. Write Count: 158680 Write Latency: 0.07918750315099571 ms. Node 2: Read Count: 251079 Read Latency: 86.91948475579399 ms. Write Count: 158450 Write Latency: 0.1744694540864626 ms. Node 3: Read Count: 149876 Read Latency: 168.14125553123915 ms. Write Count: 157896 Write Latency: 0.06468631250949992 ms. nodetool ring Address DC Rack Status State Load Effective-Ownership Token 113427455640312821154458202477256070485 10.1.3.152 datacenter1 rack1 Up Normal 35.85 GB 100.00% 0 10.1.3.153 datacenter1 rack1 Up Normal 35.86 GB 100.00% 56713727820156410577229101238628035242 10.1.3.155 datacenter1 rack1 Up Normal 35.85 GB 100.00% 113427455640312821154458202477256070485 Keyspace: benchmark: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:3] I am really confused by the Read Count number from nodetool cfstats Really appreciate any hints. -Wei From: Wei Zhu wz1...@yahoo.com To: Cassandr usergroup user@cassandra.apache.org Sent: Thursday, November 8, 2012 9:37 PM Subject: read request distribution Hi All, I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. I generated 6M rows with sequence number from 1 to 6m, so the rows should be evenly distributed among the three nodes disregarding the replicates. I am doing a benchmark with read only requests, I generate read request for randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that one node has only half the requests as the other one and the third node sits in the middle. So the ratio is like 2:3:4. The node with the most read requests actually has the smallest latency and the one with the least read requests reports the largest latency. The difference is pretty big, the fastest is almost double the slowest. All three nodes have the exactly the same hardware and the data size on each node are the same since the RF is three and all of them have the complete data. I am using Hector as client and the random read request are in millions. I can't think of a reasonable explanation. Can someone please shed some lights? Thanks. -Wei
read request distribution
Hi All, I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. I generated 6M rows with sequence number from 1 to 6m, so the rows should be evenly distributed among the three nodes disregarding the replicates. I am doing a benchmark with read only requests, I generate read request for randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that one node has only half the requests as the other one and the third node sits in the middle. So the ratio is like 2:3:4. The node with the most read requests actually has the smallest latency and the one with the least read requests reports the largest latency. The difference is pretty big, the fastest is almost double the slowest. All three nodes have the exactly the same hardware and the data size on each node are the same since the RF is three and all of them have the complete data. I am using Hector as client and the random read request are in millions. I can't think of a reasonable explanation. Can someone please shed some lights? Thanks. -Wei
Re: composite column validation_class question
Any thoughts? Thanks. -Wei From: Wei Zhu wz1...@yahoo.com To: Cassandr usergroup user@cassandra.apache.org Sent: Wednesday, November 7, 2012 12:47 PM Subject: composite column validation_class question Hi All, I am trying to design my schema using composite column. One thing I am a bit confused is how to define validation_class for the composite column, or is there a way to define it? for the composite column, I might insert different value based on the column name, for example I will insert date for column created: set user[1]['7:1:100:created'] = 1351728000; and insert String for description set user[1]['7:1:100:desc'] = my description; I don't see a way to define validation_class for composite column. Am I right? Thanks. -Wei
composite column validation_class question
Hi All, I am trying to design my schema using composite column. One thing I am a bit confused is how to define validation_class for the composite column, or is there a way to define it? for the composite column, I might insert different value based on the column name, for example I will insert date for column created: set user[1]['7:1:100:created'] = 1351728000; and insert String for description set user[1]['7:1:100:desc'] = my description; I don't see a way to define validation_class for composite column. Am I right? Thanks. -Wei
Create CF with composite column through CQL 3
I try to use CQL3 to create CF with composite columns, CREATE TABLE Friends ( ... user_id bigint, ... friend_id bigint, ... status int, ... source int, ... created timestamp, ... lastupdated timestamp, ... PRIMARY KEY (user_id, friend_id, status, source) ... ); When I check it with cli, the composite type is a bit odd, why it's defined as Long, Int32, Int32, UTF8, is it supposed to be Long, Long, Int32, Int32? Did I do something wrong? describe friends; ColumnFamily: friends Key Validation Class: org.apache.cassandra.db.marshal.LongType Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.CompositeType( org.apache.cassandra.db.marshal.LongType, org.apache.cassandra.db.marshal.Int32Type, org.apache.cassandra.db.marshal.Int32Type, org.apache.cassandra.db.marshal.UTF8Type) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.1 DC Local Read repair chance: 0.0 Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy Compression Options: sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor Thanks. -Wei
Re: Create CF with composite column through CQL 3
Thanks Tristan and Sylvain, it all makes sense now. One follow up question regarding the composite column. It seems that for the where clause I can only restrict the query on the first composite column (friend_id, in my case). I understand it's determined by the underlining row storage structure. Is any plan to improve that to be able to search on the other composite columns if I don't care about the performance. cqlsh:demo select * from friends where source = 7; Bad Request: PRIMARY KEY part source cannot be restricted (preceding part status is either not restricted or by a non-EQ relation) Thanks. -Wei From: Tristan Seligmann mithra...@mithrandi.net To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com Sent: Wednesday, October 31, 2012 10:47 AM Subject: Re: Create CF with composite column through CQL 3 On Wed, Oct 31, 2012 at 7:14 PM, Wei Zhu wz1...@yahoo.com wrote: I try to use CQL3 to create CF with composite columns, CREATE TABLE Friends ( ... user_id bigint, ... friend_id bigint, ... status int, ... source int, ... created timestamp, ... lastupdated timestamp, ... PRIMARY KEY (user_id, friend_id, status, source) ... ); When I check it with cli, the composite type is a bit odd, why it's defined as Long, Int32, Int32, UTF8, is it supposed to be Long, Long, Int32, Int32? The first component of the PRIMARY KEY (user_id) is the row key: Key Validation Class: org.apache.cassandra.db.marshal.LongType The rest of the components (friend_id, status, source) are part of the column name: Columns sorted by: org.apache.cassandra.db.marshal.CompositeType( org.apache.cassandra.db.marshal.LongType, org.apache.cassandra.db.marshal.Int32Type, org.apache.cassandra.db.marshal.Int32Type, and the final component of of the column name is the CQL-level column name: org.apache.cassandra.db.marshal.UTF8Type) In this case, it will be created or lastupdated as those are the only columns not part of the PRIMARY KEY. -- mithrandi, i Ainil en-Balandor, a faer Ambar
Re: Benifits by adding nodes to the cluster
I heard about virtual nodes. But it doesn't come out until 1.2. Is it easy to convert the existing installation to use virtual nodes? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Wednesday, October 31, 2012 2:23 PM Subject: Re: Benifits by adding nodes to the cluster I have been told that it's much easier to scale the cluster by doubling the number of nodes, since no token changed needed on the existing nodes.Yup. But if the number of nodes is substantial, it's not realistic to double it every time.See the keynote from Jonathan Ellis or the talk on Virtual Nodes from Sam here http://www.datastax.com/events/cassandrasummit2012/presentations virtual nodes make this sort of thing faster and easier How easy is to add let's say 3 additional nodes to the existing 10 nodes?In that scenario would would need to move every node. But if you have 10 nodes you probably don't want to scale up by 3, I would guess 5 or 10. Scaling is not something you want to do every day. How easy the process is depends on the level of automation in your environment. For example Ops Centre can automate rebalancing nodes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/10/2012, at 7:14 AM, weiz wz1...@yahoo.com wrote: One follow up questions. I have been told that it's much easier to scale the cluster by doubling the number of nodes, since no token changed needed on the existing nodes. But if the number of nodes is substantial, it's not realistic to double it every time. How easy is to add let's say 3 additional nodes to the existing 10 nodes? I understand the process of moving around data and delete unused data. Just want to understand from the operational point of view, how difficult is that? We are in the processing of evaluating the nosql solution, one important consideration is the operation cost. Any real world experience is very much appreciated. Thanks. -Wei -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Benifits-by-adding-nodes-to-the-cluster-tp7583437p7583466.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.