Re: Read-repair working, repair not working?
Hi Aaron, Many thanks for your reply - answers below. Cheers, Brian What CL are you using for reads and writes? I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. CL.ONE : this is primarily for performance reasons but also because there are only three local nodes as you suggest and we need at least some resiliency. In the context of this issue, I considered increasing this to CL.LOCAL_QUORUM but the behaviour suggests than none of the 3 local nodes have the data (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help? Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. This DC is remote in terms of network topology - it's in Asia (Hong Kong) while the rest of the cluster is in Europe/North America, so network latency rather than congestion could be a cause? However I see some pretty aggressive data transfer speeds during the initial repairs the data footprint approximately matches the nodes elsewhere in the ring, so something doesn't add up? Here are the tpstats for one of these nodes : Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 0 4919185 0 0 RequestResponseStage 0 0 16869994 0 0 MutationStage 0 0 16764910 0 0 ReadRepairStage 0 0 3703 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 845225 0 0 AntiEntropyStage 0 0 52441 0 0 MigrationStage0 0 4362 0 0 MemtablePostFlusher 0 0952 0 0 StreamStage 0 0 24 0 0 FlushWriter 0 0960 0 5 MiscStage 0 0 3592 0 0 AntiEntropySessions 4 4121 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 1 2 55 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 150597 BINARY 0 READ781490 MUTATION853846 REQUEST_RESPONSE 0 The numbers of READ_REPAIR, READ MUTATION operations are non-negligable. The nodes in Europe/North America have effectively zero dropped messages. This suggests network latency is probably a significant factor? [the network ping from Europe to a HK node is ~250ms, so I wouldn’t have expected it to be such a problem?] It would, but the INFO logging for the AES is pretty good. I would hold off for now. Ok. [AES session logging] Yes, I see the expected start/end logs, so that's another thing off the list. On 10 Feb 2013, at 20:12, aaron morton aa...@thelastpickle.com wrote: I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned: What CL are you using for reads and writes? I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count. Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. - Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. - Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when for what proportion of total requests)? It would, but the INFO logging for the AES is pretty good. I would hold off for now. - Is there something obvious that I could be missing here? When a new AES session starts it logs this logger.info(String.format([repair #%s] new session: will sync %s on range %s for %s.%s, getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames))); When it completes it logs this
Re: Cassandra 1.1.2 - 1.1.8 upgrade
Not sure this will be useful for you but nodetool drain doesn't work properly well for a while. If you are using counters I recommend you to remove commit logs after you drained ans stopped the node, before restarting the node to avoid replaying counts. https://issues.apache.org/jira/browse/CASSANDRA-4446 Hope this will help, Alain 2013/2/11 Michal Michalski mich...@opera.com 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2-1.1.9 configuration for a number of days. I'm about to upgrade my 1.1.0 cluster and http://www.datastax.com/docs/**1.1/install/upgrading#infohttp://www.datastax.com/docs/1.1/install/upgrading#infosays: If you are upgrading to Cassandra 1.1.9 from a version earlier than 1.1.7, all nodes must be upgraded before any streaming can take place. Until you upgrade all nodes, you cannot add version 1.1.7 nodes or later to a 1.1.7 or earlier cluster. Which one is correct then? Can I run mixed 1.1.2 (in my case 1.1.0) 1.1.9 cluster or not? M.
Re: High CPU usage during repair
What machine size? m1.large If you are seeing high CPU move to an m1.xlarge, that's the sweet spot. That's normally ok. How many are waiting? I have seen 4 this morning That's not really abnormal. The pending task count goes when when a file *may* be eligible for compaction, not when there is a compaction task waiting. If you suddenly create a number of new SSTables for a CF the pending count will rise, however one of the tasks may compact all the sstables waiting for compaction. So the count will suddenly drop as well. Just to make sure I understand you correctly, you suggest that I change throughput to 12 regardless of whether repair is ongoing or not. I will do it using nodetool and change the yaml file in case a restart will occur in the future? Yes. If you are seeing performance degrade during compaction or repair try reducing the throughput. I would attribute most of the problems you have described to using m1.large. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 9:16 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! Thanks for the response. See my answers and questions below. Thanks! Tamar Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Sun, Feb 10, 2013 at 10:04 PM, aaron morton aa...@thelastpickle.com wrote: During repair I see high CPU consumption, Repair reads the data and computes a hash, this is a CPU intensive operation. Is the CPU over loaded or is just under load? Usually just load, but in the past two weeks I have seen CPU of over 90%! I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. What machine size? m1.large there are compactions waiting. That's normally ok. How many are waiting? I have seen 4 this morning I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 That will remove throttling on compaction and the validation compaction used for the repair. Which may in turn add additional IO load, CPU load and GC pressure. You probably do not want to do this. Try reducing the compaction throughput to say 12 normally and see the effect. Just to make sure I understand you correctly, you suggest that I change throughput to 12 regardless of whether repair is ongoing or not. I will do it using nodetool and change the yaml file in case a restart will occur in the future? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 1:01 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I run repair weekly, using a scheduled cron job. During repair I see high CPU consumption, and messages in the log file INFO [ScheduledTasks:1] 2013-02-10 11:48:06,396 GCInspector.java (line 122) GC for ParNew: 208 ms for 1 collections, 1704786200 used; max is 3894411264 From time to time, there are also messages of the form INFO [ScheduledTasks:1] 2012-12-04 13:34:52,406 MessagingService.java (line 607) 1 READ messages dropped in last 5000ms Using opscenter, jmx and nodetool compactionstats I can see that during the time the CPU consumption is high, there are compactions waiting. I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. I have the default settings: compaction_throughput_mb_per_sec: 16 in_memory_compaction_limit_in_mb: 64 multithreaded_compaction: false compaction_preheat_key_cache: true I am thinking on the following solution, and wanted to ask if I am on the right track: I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 Is this a right solution? Thanks, Tamar Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956
Re: CQL 3 compound row key error
That sounds like a bug, or something that is still under work. Sylvain has his finger on all things CQL. Can you raise a ticket on https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 4:01 PM, Shahryar Sedghi shsed...@gmail.com wrote: I am moving my application from 1.1 to 1.2.1 to utilize secondary index and simplify the data model. In 1.1 I was concentrating some fields into one separated by : for the row key and it was a big string. In V1.2 I use compound rows key showed in the following test case (interval and seq): CREATE TABLE test( interval text, seq int, id int, severity int, PRIMARY KEY ((interval, seq), id)) WITH CLUSTERING ORDER BY (id DESC); -- CREATE INDEX ON test(severity); select * from test where severity = 3 and interval = 't' and seq =1; results: Bad Request: Start key sorts after end key. This is not allowed; you probably should not specify end key at all under random partitioner If I define the table as this: CREATE TABLE test( interval text, id int, severity int, PRIMARY KEY (interval, id)) WITH CLUSTERING ORDER BY (id DESC); select * from test where severity = 3 and interval = 't1'; Works fine. Is it a bug? Thanks in Advance Shahryar -- Life is what happens while you are making other plans. ~ John Lennon
Re: High CPU usage during repair
Thank you very much! Due to monetary limitations I will keep the m1.large for now, but try the throughput modification. Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Mon, Feb 11, 2013 at 11:30 AM, aaron morton aa...@thelastpickle.comwrote: What machine size? m1.large If you are seeing high CPU move to an m1.xlarge, that's the sweet spot. That's normally ok. How many are waiting? I have seen 4 this morning That's not really abnormal. The pending task count goes when when a file *may* be eligible for compaction, not when there is a compaction task waiting. If you suddenly create a number of new SSTables for a CF the pending count will rise, however one of the tasks may compact all the sstables waiting for compaction. So the count will suddenly drop as well. Just to make sure I understand you correctly, you suggest that I change throughput to 12 regardless of whether repair is ongoing or not. I will do it using nodetool and change the yaml file in case a restart will occur in the future? Yes. If you are seeing performance degrade during compaction or repair try reducing the throughput. I would attribute most of the problems you have described to using m1.large. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 9:16 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! Thanks for the response. See my answers and questions below. Thanks! Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Sun, Feb 10, 2013 at 10:04 PM, aaron morton aa...@thelastpickle.comwrote: During repair I see high CPU consumption, Repair reads the data and computes a hash, this is a CPU intensive operation. Is the CPU over loaded or is just under load? Usually just load, but in the past two weeks I have seen CPU of over 90%! I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. What machine size? m1.large there are compactions waiting. That's normally ok. How many are waiting? I have seen 4 this morning I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 That will remove throttling on compaction and the validation compaction used for the repair. Which may in turn add additional IO load, CPU load and GC pressure. You probably do not want to do this. Try reducing the compaction throughput to say 12 normally and see the effect. Just to make sure I understand you correctly, you suggest that I change throughput to 12 regardless of whether repair is ongoing or not. I will do it using nodetool and change the yaml file in case a restart will occur in the future? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 1:01 AM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I run repair weekly, using a scheduled cron job. During repair I see high CPU consumption, and messages in the log file INFO [ScheduledTasks:1] 2013-02-10 11:48:06,396 GCInspector.java (line 122) GC for ParNew: 208 ms for 1 collections, 1704786200 used; max is 3894411264 From time to time, there are also messages of the form INFO [ScheduledTasks:1] 2012-12-04 13:34:52,406 MessagingService.java (line 607) 1 READ messages dropped in last 5000ms Using opscenter, jmx and nodetool compactionstats I can see that during the time the CPU consumption is high, there are compactions waiting. I run Cassandra version 1.0.11, on 3 node setup on EC2 instances. I have the default settings: compaction_throughput_mb_per_sec: 16 in_memory_compaction_limit_in_mb: 64 multithreaded_compaction: false compaction_preheat_key_cache: true I am thinking on the following solution, and wanted to ask if I am on the right track: I thought of adding a call to my repair script, before repair starts to do: nodetool setcompactionthroughput 0 and then when repair finishes call nodetool setcompactionthroughput 16 Is this a right solution? Thanks, Tamar *Tamar Fraenkel * Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 tokLogo.png
Re: Read-repair working, repair not working?
CL.ONE : this is primarily for performance reasons … This makes reasoning about correct behaviour a little harder. If there is anyway you can run some tests with R + W N strong consistency I would encourage you to do so. You will then have a baseline of what works. (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help? The high number of inconsistencies seems to match with the massive number of dropped Mutation messages. Even if Anti Entropy is running, if the node in HK is dropping so many messages there will be inconsistencies. It looks like the HK node is overloaded. I would check the logs for GC messages, check for VM steal in a virtualised env, check for sufficient CPU + memory resources, check for IO stress. 20 node cluster running v1.0.7 split between 5 data centres, I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I Do all DC's have the same number of nodes ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 9:13 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi Aaron, Many thanks for your reply - answers below. Cheers, Brian What CL are you using for reads and writes? I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. CL.ONE : this is primarily for performance reasons but also because there are only three local nodes as you suggest and we need at least some resiliency. In the context of this issue, I considered increasing this to CL.LOCAL_QUORUM but the behaviour suggests than none of the 3 local nodes have the data (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help? Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. This DC is remote in terms of network topology - it's in Asia (Hong Kong) while the rest of the cluster is in Europe/North America, so network latency rather than congestion could be a cause? However I see some pretty aggressive data transfer speeds during the initial repairs the data footprint approximately matches the nodes elsewhere in the ring, so something doesn't add up? Here are the tpstats for one of these nodes : Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 0 4919185 0 0 RequestResponseStage 0 0 16869994 0 0 MutationStage 0 0 16764910 0 0 ReadRepairStage 0 0 3703 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 845225 0 0 AntiEntropyStage 0 0 52441 0 0 MigrationStage0 0 4362 0 0 MemtablePostFlusher 0 0952 0 0 StreamStage 0 0 24 0 0 FlushWriter 0 0960 0 5 MiscStage 0 0 3592 0 0 AntiEntropySessions 4 4121 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 1 2 55 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 150597 BINARY 0 READ781490 MUTATION853846 REQUEST_RESPONSE 0 The numbers of READ_REPAIR, READ MUTATION operations are non-negligable. The nodes in Europe/North America have effectively zero dropped messages. This suggests network latency is probably a significant factor? [the network ping from Europe to a HK node is ~250ms, so I wouldn’t have expected it to be such a problem?] It would, but the INFO logging for the AES is pretty good. I would hold off for now. Ok. [AES session logging] Yes, I see the expected start/end logs, so that's another thing off the list. On 10 Feb 2013, at 20:12, aaron morton aa...@thelastpickle.com wrote: I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned:
Re: Cassandra 1.1.2 - 1.1.8 upgrade
You can always run them. But in some situations repair cannot be used, and in this case new nodes cannot be added. The news.txt file is your friend there. As a general rule when upgrading a cluster I move one node to the new version and let it soak in for an hour or so. Just to catch any crazy. I then upgrade all the nodes and run through the upgrade table. You can stagger upgrade table to be every RF'th node in the cluster to reduce the impact. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 8:05 PM, Michal Michalski mich...@opera.com wrote: 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2-1.1.9 configuration for a number of days. I'm about to upgrade my 1.1.0 cluster and http://www.datastax.com/docs/1.1/install/upgrading#info says: If you are upgrading to Cassandra 1.1.9 from a version earlier than 1.1.7, all nodes must be upgraded before any streaming can take place. Until you upgrade all nodes, you cannot add version 1.1.7 nodes or later to a 1.1.7 or earlier cluster. Which one is correct then? Can I run mixed 1.1.2 (in my case 1.1.0) 1.1.9 cluster or not? M.
Re: Cassandra 1.1.2 - 1.1.8 upgrade
OK, thanks Aaron. I ask because NEWS.txt is not a big help in case of 1.1.5 versions because there's no info on them in it (especially on 1.1.7 which seems to be the most important one in this case, according to the DataStax' upgrade instructions) ;-) https://github.com/apache/cassandra/blob/trunk/NEWS.txt M. W dniu 11.02.2013 11:05, aaron morton pisze: You can always run them. But in some situations repair cannot be used, and in this case new nodes cannot be added. The news.txt file is your friend there. As a general rule when upgrading a cluster I move one node to the new version and let it soak in for an hour or so. Just to catch any crazy. I then upgrade all the nodes and run through the upgrade table. You can stagger upgrade table to be every RF'th node in the cluster to reduce the impact.
Re: CQL 3 compound row key error
Thanks Aaron. Opened CASSANDRA-5240https://issues.apache.org/jira/browse/CASSANDRA-5240CASSANDRA-5240 On Mon, Feb 11, 2013 at 4:34 AM, aaron morton aa...@thelastpickle.comwrote: That sounds like a bug, or something that is still under work. Sylvain has his finger on all things CQL. Can you raise a ticket on https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/02/2013, at 4:01 PM, Shahryar Sedghi shsed...@gmail.com wrote: I am moving my application from 1.1 to 1.2.1 to utilize secondary index and simplify the data model. In 1.1 I was concentrating some fields into one separated by : for the row key and it was a big string. In V1.2 I use compound rows key showed in the following test case (interval and seq): CREATE TABLE test( interval text, seq int, id int, severity int, PRIMARY KEY ((interval, seq), id)) WITH CLUSTERING ORDER BY (id DESC); -- CREATE INDEX ON test(severity); select * from test where severity = 3 and interval = 't' and seq =1; results: Bad Request: Start key sorts after end key. This is not allowed; you probably should not specify end key at all under random partitioner If I define the table as this: CREATE TABLE test( interval text, id int, severity int, PRIMARY KEY (interval, id)) WITH CLUSTERING ORDER BY (id DESC); select * from test where severity = 3 and interval = 't1'; Works fine. Is it a bug? Thanks in Advance Shahryar -- Life is what happens while you are making other plans. ~ John Lennon -- Life is what happens while you are making other plans. ~ John Lennon
RuntimeException during leveled compaction
Hi, I'm running a 6 node Cassandra 1.1.5 cluster on EC2. We have switched to leveled compaction a couple of weeks ago, this has been successful. Some days ago 3 of the nodes start to log the following exception during compaction of a particular column family: ERROR [CompactionExecutor:726] 2013-02-11 13:02:26,582 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[CompactionExecutor:726,1,main] java.lang.RuntimeException: Last written key DecoratedKey(84590743047470232854915142878708713938, 3133353533383530323237303130313030303232313537303030303132393832) = current key DecoratedKey(28357704665244162161305918843747894551, 31333430313336313830333831303130313030303230313632303030303036363338) writing into /var/cassandra/data/AdServer/EventHistory/Adserver-EventHistory-tmp-he-68638-Data.db at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159) at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) at org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Compaction does not happen any more for the column family and read performance gets worse because of the growing number of data files accessed during reads. Looks like one or more of the data files are corrupt and have keys that are stored out of order. Any help to resolve this situation would be greatly appreciated. Thanks Andre
Re: Cassandra jmx stats ReadCount
Are you using counters? They require a read before write. Also secondary index CF's require a read before write. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 1:26 PM, Daning Wang dan...@netseer.com wrote: We have 8 nodes cluster in Casandra 1.1.0, with replication factor is 3. We found that when you just insert data, not only WriteCount increases, the ReadCount also increases. How could this happen? I am under the impression that readCount only counts the reads from client. Thanks, Daning
RE: unbalanced ring
Aaron, thanks for your feedback. .125 num_tokens: 256 # initial_token: .126 num_tokens: 256 #initial_token: .127 num_tokens: 256 # initial_token: This all looks correct. So when you say to do this with a clean setup, what are you asking me to do? Is it enough to blow away /var/lib/cassandra and reload the data? Also destroy my Cassandra install (which is just un-tar) and reinstall from nothing? Stephen Thompson Wells Fargo Corporation Internet Authentication Fraud Prevention 704.427.3137 (W) | 704.807.3431 (C) This message may contain confidential and/or privileged information, and is intended for the use of the addressee only. If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Monday, February 11, 2013 12:51 PM To: user@cassandra.apache.org Subject: Re: unbalanced ring The tokens are not right, not right at all. Some are too short and some are too tall. More technically they do not appear to be randomly arranged. The tokens for the .125 node all start with -3, the 126 node only has negative tokens and the 127 node mostly has positive tokens. Check that on each node the initial_token yaml setting is commented out, and that num_tokens is set to 256. If you can reproduce this fault with a clean setup please raise a ticket at https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 10:36 AM, stephen.m.thomp...@wellsfargo.commailto:stephen.m.thomp...@wellsfargo.com wrote: I found when I tried to do queries after sending this that although it shows a ton of data, it would no longer return ANYTHING for any query ... always 0 rows. So something was severely hosed. I blew away the data and reloaded from database ... the data set is a little smaller than before. It shows up somewhat more balanced, although I'm still curious why the third node is so much smaller than the first two. [root@Config3482VM1 apache-cassandra-1.2.1]# bin/nodetool status Datacenter: 28 == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.28.205.125 994.89 MB 255 33.7% 3daab184-61f0-49a0-b076-863f10bc8c6c 205 UN 10.28.205.126 966.17 MB 256 99.9% 55bbd4b1-8036-4e32-b975-c073a7f0f47f 205 UN 10.28.205.127 699.79 MB 257 66.4% d240c91f-4901-40ad-bd66-d374a0ccf0b9 205 [root@Config3482VM1 apache-cassandra-1.2.1]# And yes, that is the entire content of the output from the status call, unedited. I have attached the output from nodetool ring. To answer a couple of the questions from below from Eric: * One data center (28)? One rack (205)? Three nodes? Yes, that's right. We're just doing a proof of concept at the moment so this is three VMWare servers. * How many keyspaces, and what are the replication strategies? There is one keyspace, and it has only one CF at this point. [default@KEYSPACE_NAME] describe; Keyspace: KEYSPACE_NAME: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [28:2] * TL;DR What Aaron Said(tm) In the absence of rack/dc aware replication, your allocation is suspicious. I'm not sure what you mean by this. Steve -Original Message- From: Eric Evans [mailto:eev...@acunu.comhttp://acunu.com] Sent: Thursday, February 07, 2013 9:56 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: unbalanced ring On Wed, Feb 6, 2013 at 2:02 PM, stephen.m.thomp...@wellsfargo.commailto:stephen.m.thomp...@wellsfargo.com wrote: Thanks Aaron. I ran the cassandra-shuffle job and did a rebuild and compact on each of the nodes. [root@Config3482VM1 apache-cassandra-1.2.1]# bin/nodetool status Datacenter: 28 == Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.28.205.125 1.7 GB 255 33.7% 3daab184-61f0-49a0-b076-863f10bc8c6c 205 UN 10.28.205.126 591.44 MB 256 99.9% 55bbd4b1-8036-4e32-b975-c073a7f0f47f 205 UN 10.28.205.127 112.28 MB 257 66.4% d240c91f-4901-40ad-bd66-d374a0ccf0b9 205 Sorry, I have to ask, Is this the complete output? Have you perhaps sanitized it in some way? It seems like there is some piece of missing context here. Can you tell us: * Is this a cluster that was upgraded to virtual nodes (that would include a 1.2.x cluster initialized with
Time complexity of cassandra operations
Hi, I've tried searching for this all over the place, but I can't find an answer anywhere... What is the (theoretical) time complexity of basic C* operations? I assume that single lookups are O(log(R/N)) for R rows across N nodes (as SST lookups should be O(log(n)) and there are R/N rows per node). Writes with consistency 1 should be O(1) as they're just appended.. But what about iteration through rows? I don't think I know enough about how iteration is implemented to guess what the complexity is here. Tim
what addresses to use in EC2 cluster (whenever an instance restarts it gets a new private ip)?
How do I configure my cluster to run in EC2? In my cassandra.yaml I have IP addresses under seed_provider, listen_address and rpc_address. I tried setting up my cluster using just the EC2 private addresses but when one of my instances failed and I restarted it there was a new private address. Suddenly my cluster thought it have five nodes rather than four. Then I tried using Elastic IP addresses (permanent addresses) but it turns out you get charged for network traffic between elastic addresses even if they are within the cluster. So...how do you configure the cluster when the IP addresses can change out from under you? Thanks. Brian Tarbox
Re: Directory structure after upgrading 1.0.8 to 1.2.1
I think it's a little more subtle that that https://issues.apache.org/jira/browse/CASSANDRA-5242 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 8/02/2013, at 10:21 PM, Desimpel, Ignace ignace.desim...@nuance.com wrote: Yes it are new directories. I did some debugging … The Cassandra code is org.apache.cassandra.db.Directories::migrateFile. It is detecting that it is a manifest (based on the .json extension). But then it does not take in account that something like MyColumnFamily-old.json can exist. Then it is using MyColumnFamily-old as a directory name in a call to a function destDir = getOrCreate(ksDir, dirname, additionalPath), while it should be MyColumnFamily. So I guess that the cfname computation should be adapted to include the “-old.json” manifest files. Ignace From: aaron morton [mailto:aa...@thelastpickle.com] Sent: vrijdag 8 februari 2013 03:09 To: user@cassandra.apache.org Subject: Re: Directory structure after upgrading 1.0.8 to 1.2.1 the -old.json is an artefact of Levelled Compaction. You should see a non -old file in the current CF folder. I'm not sure what would have created the -old CF dir. Does the timestamp indicate it was created the time the server first started as a 1.2 node? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 7/02/2013, at 10:39 PM, Desimpel, Ignace ignace.desim...@nuance.com wrote: After upgrading from 1.0.8 I see that now the directory structure has changed and has a structure like keyspacecolumnfamily (part of the 1.1.x migration). But I also see that directories appear like keyspacecolumnfamily-old, and the content of that ‘old’ directory is only one file columnfamily-old.json. Questions : Should this xxx-old.json file be in the other directory? Should the extra directory xxx-old not be created? Or was that intentionally done and is it allowed to remove these directories ( manually … )? Thanks
Cassandra becnhmark
Hi - I am trying to do benchmark using the Cassandra-stress tool. They have given an example to insert data across 2 nodes - /tools/stress/bin/stress -d 192.168.1.101,192.168.1.102 -n 1000 But when I run this across my 2 node cluster, I see the same keys in both nodes. Replication is not enabled. Should it not have unique keys in both nodes ? Thanks, Kanwar
Re: what addresses to use in EC2 cluster (whenever an instance restarts it gets a new private ip)?
You have to use private IPs, but if an instance dies you have to bootstrap it with replace token flag. If you use EC2 I'd recommend Netflix's Priam tool. It manages all that stuff, plus you have S3 backup. Andrey On Mon, Feb 11, 2013 at 11:35 AM, Brian Tarbox tar...@cabotresearch.comwrote: How do I configure my cluster to run in EC2? In my cassandra.yaml I have IP addresses under seed_provider, listen_address and rpc_address. I tried setting up my cluster using just the EC2 private addresses but when one of my instances failed and I restarted it there was a new private address. Suddenly my cluster thought it have five nodes rather than four. Then I tried using Elastic IP addresses (permanent addresses) but it turns out you get charged for network traffic between elastic addresses even if they are within the cluster. So...how do you configure the cluster when the IP addresses can change out from under you? Thanks. Brian Tarbox
Re: Cassandra libraries for Golang
Hi Boris, I use this one with Cassandra 1.2+ (you'll need to turn the native port on): https://github.com/titanous/gocql HTH, Ben On Friday, 8 February 2013 at 16:40, Boris Solovyov wrote: Hi, I'm developing Go application. I see there is gossie, which doesn't support the native binary protocol, and there is gocql, which I tried. I wasn't able to connect to my Cassandra server. I got EOF after a timeout. I didn't investigate it much further. I wanted to ask the list, what is status of Cassandra libraries for Go? Is anyone using one of them successfully in production? Does it really matter whether I use new native protocol? - Boris
Re: Cassandra 1.1.2 - 1.1.8 upgrade
So the upgrade sstables is recommended as part of the upgrade to 1.1.3 if you are using counter columns Also, there was a general recommendation (in another response to my question) to run upgrade sstables because of: upgradesstables always needs to be done between majors. While 1.1.2 - 1.1.8 is not a major, due to an unforeseen bug in the conversion to microseconds you'll need to run upgradesstables. Is this referring to: https://issues.apache.org/jira/browse/CASSANDRA-4432 Can anyone know the impact of not running upgrade sstables? Or possible not running it for several days? Thanks, -Mike On 2/10/2013 3:27 PM, aaron morton wrote: I would do #1. You can play with nodetool setcompactionthroughput to speed things up, but beware nothing comes for free. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 6:40 AM, Mike mthero...@yahoo.com mailto:mthero...@yahoo.com wrote: Thank you, Another question on this topic. Upgrading from 1.1.2-1.1.9 requires running upgradesstables, which will take many hours on our dataset (about 12). For this upgrade, is it recommended that I: 1) Upgrade all the DB nodes to 1.1.9 first, then go around the ring and run a staggered upgrade of the sstables over a number of days. 2) Upgrade one node at a time, running the clustered in a mixed 1.1.2-1.1.9 configuration for a number of days. I would prefer #1, as with #2, streaming will not work until all the nodes are upgraded. I appreciate your thoughts, -Mike On 1/16/2013 11:08 AM, Jason Wee wrote: always check NEWS.txt for instance for cassandra 1.1.3 you need to run nodetool upgradesstables if your cf has counter. On Wed, Jan 16, 2013 at 11:58 PM, Mike mthero...@yahoo.com mailto:mthero...@yahoo.com wrote: Hello, We are looking to upgrade our Cassandra cluster from 1.1.2 - 1.1.8 (or possibly 1.1.9 depending on timing). It is my understanding that rolling upgrades of Cassandra is supported, so as we upgrade our cluster, we can do so one node at a time without experiencing downtime. Has anyone had any gotchas recently that I should be aware of before performing this upgrade? In order to upgrade, is the only thing that needs to change are the JAR files? Can everything remain as-is? Thanks, -Mike
Re: Upgrade to Cassandra 1.2
Thanks Aaron. I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed. - I followed http://www.datastax.com/docs/1.2/install/upgrading, have merged cassandra.yaml, with follow parameter num_tokens: 256 #initial_token: 0 the initial_token is commented out, current token should be obtained from system schema - I did rolling upgrade, during the upgrade, I got Borken Pipe error from the nodes with old version, is that normal? - After I upgraded 3 nodes(still have 5 to go), I found it is total wrong, the first node upgraded owns 99.2 of ring [cassy@d5:/usr/local/cassy conf]$ ~/bin/nodetool -h localhost status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack DN 10.210.101.11745.01 GB 254 99.2% f4b6afe3-7e2e-4c61-96e8-12a529a31373 rack1 UN 10.210.101.12045.43 GB 256 0.4% 0fd912fb-3187-462b-8c8a-7d223751b649 rack1 UN 10.210.101.11127.08 GB 256 0.4% bd4c37bc-07dd-488b-bfab-e74e32c26f6e rack1 What was wrong? please help. I could provide more information if you need. Thanks, Daning On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.comwrote: There is a command line utility in 1.2 to shuffle the tokens… http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes $ ./cassandra-shuffle --help Missing sub-command argument. Usage: shuffle [options] sub-command Sub-commands: create Initialize a new shuffle operation ls List pending relocations clearClear pending relocations en[able] Enable shuffling dis[able]Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enableImmediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote: I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is that it can have multiple tokens in one node. but there is only one token in 1.1.6. how can I upgrade to 1.2.1 then breaking the token to take advantage of this feature? I went through this doc but it does not say how to change the num_token http://www.datastax.com/docs/1.2/install/upgrading Is there other doc about this upgrade path? Thanks, Daning I think for each node you need to change the num_token option in conf/cassandra.yaml (this only split the current range into num_token parts) and run the bin/cassandra-shuffle command (this spread it all over the ring).
Cassandra 1.2 Atomic Batches and Thrift API
Hey Guys, Is the new atomic batch feature in Cassandra 1.2 available via the thrift API? If so, how can I use it? -- Drew
Spike in latency, one node keeps firing Interval min max errors
Hi there, I have a cluster of three nodes running Cassandra 1.2.0 I received alerts from my monitoring, and then discovered this huge spike in cluster latency: https://dl.dropbox.com/u/3444322/Screen%20Shot%202013-02-12%20at%205.07.49%20PM.png Investigating what is going on, there is no load on any node, iostat shows nothing more than idle operations, and I've restarted all nodes. In the system.log I keep noticing this on ONE node only: == /var/log/cassandra/system.log == ERROR [ReadStage:563] 2013-02-12 04:31:30,013 CassandraDaemon.java (line 133) Exception in thread Thread[ReadStage:563,5,main] java.lang.AssertionError: Interval min max at org.apache.cassandra.utils.IntervalTree$IntervalNode.init(IntervalTree.java:250) at org.apache.cassandra.utils.IntervalTree.init(IntervalTree.java:72) at org.apache.cassandra.utils.IntervalTree.build(IntervalTree.java:81) at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:175) at org.apache.cassandra.db.AbstractThreadUnsafeSortedColumns.delete(AbstractThreadUnsafeSortedColumns.java:40) at org.apache.cassandra.db.AbstractColumnContainer.delete(AbstractColumnContainer.java:51) at org.apache.cassandra.db.ColumnFamily.addAtom(ColumnFamily.java:224) at org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:182) at org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:154) at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:143) at org.apache.cassandra.utils.MergeIterator$ManyToOne.init(MergeIterator.java:86) at org.apache.cassandra.utils.MergeIterator.get(MergeIterator.java:45) at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:134) at org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:286) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:61) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1222) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1134) at org.apache.cassandra.db.Table.getRow(Table.java:348) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70) at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1048) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1506) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) -- Cheers, Drew Broadley *Broadley Speaking :)* e: d...@broadley.org.nz p: +64 (0)21 519 711 m:P O Box 488, Wellington, New Zealand w:http://blog.drew.broadley.org.nz/ ln:http://nz.linkedin.com/in/drewbroadley
Re: Estimating write throughput with LeveledCompactionStrategy
Yup, we set it to 100M. Currently we have around 1Tb of data per node(getting to level 5 now) + data pieces are rather large(small tables would flush more often). Yes, you're right, it's slower thus building mental models is more time effective than experimenting :) Ivan 2013/2/6 Wei Zhu wz1...@yahoo.com: I have been struggling with the LCS myself. I observed that for the higher level compaction,(from level 4 to 5) it involves much more SSTables than compacting from lower level. One compaction could take an hour or more. By the way, you set the your SSTable size to be 100M? Thanks. -Wei From: Ивaн Cобoлeв sobol...@gmail.com To: user@cassandra.apache.org Sent: Wednesday, February 6, 2013 2:42 AM Subject: Estimating write throughput with LeveledCompactionStrategy Dear Community, Could anyone please give me a hand with understanding what am I missing while trying to model how LeveledCompactionStrategy works: https://docs.google.com/spreadsheet/ccc?key=0AvNacZ0w52BydDQ3N2ZPSks2OHR1dlFmMVV4d1E2eEE#gid=0 Logs mostly contain something like this: INFO [CompactionExecutor:2235] 2013-02-06 02:32:29,758 CompactionTask.java (line 221) Compacted to [chunks-hf-285962-Data.db,chunks-hf-285963-Data.db,chunks-hf-285964-Data.db,chunks-hf-285965-Data.db,chunks-hf-285966-Data.db,chunks-hf-285967-Data.db,chunks-hf-285968-Data.db,chunks-hf-285969-Data.db,chunks-hf-285970-Data.db,chunks-hf-285971-Data.db,chunks-hf-285972-Data.db,chunks-hf-285973-Data.db,chunks-hf-285974-Data.db,chunks-hf-285975-Data.db,chunks-hf-285976-Data.db,chunks-hf-285977-Data.db,chunks-hf-285978-Data.db,chunks-hf-285979-Data.db,chunks-hf-285980-Data.db,]. 2,255,863,073 to 1,908,460,931 (~84% of original) bytes for 36,868 keys at 14.965795MB/s. Time: 121,614ms. Thus spreadsheet is parameterized with throughput being 15Mb and survivor ratio of 0.9. 1) Projected result actually differs from what I observe - what am I missing? 2) Are there any metrics on write throughput with LCS per node anyone could possibly share? Thank you very much in advance, Ivan