Re: Cassandra process exiting mysteriously
Clint, did you find anything? I just noticed it happens to us too on only one node in our CI cluster. I don't think there is a special usage before it happens... The last line in the log before the shutdown lines in at least an hour before.. We're using C* 2.0.9. On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi Rob, Thanks for the clarification; this is really useful. I'll run some experiments to see if the problem is a JVM OOM on our build machine. Best regards, Clint On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob -- Or Sher
Cassandra schema disagreement
Hello, I have a cluster running and I'm trying to change the schema on it. Altough it succeeds on one cluster (a test one), on another it keeps creating two separate schema versions (both are 2 DC configuration; the cluster where it goes wrong end up with a schema version on each DC). I use apache-cassandra11-1.1.12 on CentOS 6.4 I'm trying to start from a fresh cassandra config (doing rm -rf /var/lib/cassandra/{commitlog,data}/* while cassandra is stopped). Each DC are on separate IP segment but there are no firewall between them. Here is the output of the command when the desynchronisation occurs: --- [root@cassandranode00 CDN]# cassandra-cli -f reCreateCassandraStruct.sh Connected to: TTF Cluster v2013_1257 on 127.0.0.1/9160 7ef8c681-189a-3088-8598-560437f705d9 Waiting for schema agreement... ... schemas agree across the cluster Authenticated to keyspace: ks1 f179fd8e-f8ca-36cf-bf53-d8341fd6006e Waiting for schema agreement... The schema has not settled in 10 seconds; further migrations are ill-advised until it does. Versions are f179fd8e-f8ca-36cf-bf53-d8341fd6006e:[10.69.221.20, 10.69.221.21, 10.69.221.22], e9656b30-b671-3fce-9fb4-bdd3e6da36d1:[1 0.69.10.14, 10.69.10.13, 10.69.10.11] --- I also try creating a keyspace with a column family using the opscenter (with no good result). I'm out of hint to where to look. Do you have some suggestions ? Is there improvements on this side with cassandra 1.1.12 ? Thanks, Jonathan DEMEYER Here is the start of reCreateCassandraStruct.sh : CREATE KEYSPACE ks1 WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options={DC1:3,DC2:3}; use ks1; create column family id with comparator = 'UTF8Type' and key_validation_class = 'UTF8Type' and column_metadata = [ { column_name : 'user', validation_class : UTF8Type } ]; CREATE KEYSPACE ks2 WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options={DC1:3,DC2:3}; use ks2; create column family id;
Cassandra corrupt column family
Hello all, I have altered a table in cassandra and on one node it somehow got corrupted. I the changes did not propagate ok. Ran repair keyspace columnfamily... noting changed... Is there a way to repair this?
Replacing a dead node in Cassandra 2.0.8
In the datastax documentation there is a description how to replace a dead node (http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html). Is the replace_address option required even if the IP address of the new node is the same as the original one (I read a note about the auto bootstrapping being stored somewhere in the system tables)? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replacing-a-dead-node-in-Cassandra-2-0-8-tp7596245.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
RE: Cassandra schema disagreement
After a lot of investigation, it seems that the clocks were desynchronized through the cluster (altough we did not check that resyncing them resolve the problem, we modify the schma with one node up and restart all other nodes afterwards). From: Demeyer Jonathan [mailto:jonathan.deme...@macq.eu] Sent: mardi 12 août 2014 11:03 To: user@cassandra.apache.org Subject: Cassandra schema disagreement Hello, I have a cluster running and I'm trying to change the schema on it. Altough it succeeds on one cluster (a test one), on another it keeps creating two separate schema versions (both are 2 DC configuration; the cluster where it goes wrong end up with a schema version on each DC). I use apache-cassandra11-1.1.12 on CentOS 6.4 I'm trying to start from a fresh cassandra config (doing rm -rf /var/lib/cassandra/{commitlog,data}/* while cassandra is stopped). Each DC are on separate IP segment but there are no firewall between them. Here is the output of the command when the desynchronisation occurs: --- [root@cassandranode00 CDN]# cassandra-cli -f reCreateCassandraStruct.sh Connected to: TTF Cluster v2013_1257 on 127.0.0.1/9160 7ef8c681-189a-3088-8598-560437f705d9 Waiting for schema agreement... ... schemas agree across the cluster Authenticated to keyspace: ks1 f179fd8e-f8ca-36cf-bf53-d8341fd6006e Waiting for schema agreement... The schema has not settled in 10 seconds; further migrations are ill-advised until it does. Versions are f179fd8e-f8ca-36cf-bf53-d8341fd6006e:[10.69.221.20, 10.69.221.21, 10.69.221.22], e9656b30-b671-3fce-9fb4-bdd3e6da36d1:[1 0.69.10.14, 10.69.10.13, 10.69.10.11] --- I also try creating a keyspace with a column family using the opscenter (with no good result). I'm out of hint to where to look. Do you have some suggestions ? Is there improvements on this side with cassandra 1.1.12 ? Thanks, Jonathan DEMEYER Here is the start of reCreateCassandraStruct.sh : CREATE KEYSPACE ks1 WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options={DC1:3,DC2:3}; use ks1; create column family id with comparator = 'UTF8Type' and key_validation_class = 'UTF8Type' and column_metadata = [ { column_name : 'user', validation_class : UTF8Type } ]; CREATE KEYSPACE ks2 WITH placement_strategy = 'NetworkTopologyStrategy' AND strategy_options={DC1:3,DC2:3}; use ks2; create column family id;
Re: Cassandra corrupt column family
Hi, Without more information (Cassandra version, setup, topology, schema, queries performed) this list won't be able to assist you. If you can provide a more detailed explanation of the steps you took to reach your current state that would be great. Mark On Tue, Aug 12, 2014 at 12:21 PM, Batranut Bogdan batra...@yahoo.com wrote: Hello all, I have altered a table in cassandra and on one node it somehow got corrupted. I the changes did not propagate ok. Ran repair keyspace columnfamily... noting changed... Is there a way to repair this?
Re: clarification on 100k tombstone limit in indexes
Hello Ian So that way each index entry *will* have quite a few entries and the index as a whole won't grow too big. Is my thinking correct here? -- In this case yes. Do not forget that for each date value, there will be 1 corresponding index value + 10 updates. If you have an approximate count for a few entries, a quick maths should give you an idea about how large the index partition is I had considered an approach like this but my concern is that for any given minute *all* of the updates will be handled by a single node, right? -- If you time resolution is a minute, yes it will be a problem. And depending on the insert rate, it can become a quickly a bottle neck during this minute. The manual index approach suffers a lot from bottleneck issue for heavy workload, that's the main reason they implement a distributed secondary index. There is no free lunch though. What you gain in term of control and tuning with the manual index, you loose on the load distribution side. On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose ianr...@fullstory.com wrote: Hi DuyHai, Thanks for the detailed response! A few responses below: On a side node, your usage of secondary index is not the best one. Indeed, indexing the update date will lead to a situation where for one date, you'll mostly have one or a few matching items (assuming that the update date resolution is small enough and update rate is not intense). -- I should have mentioned this original (slipped my mind) but to deal specifically with this problem I had planned to use a timestamp with a resolution of 1 minute (like your minute_bucket). So that way each index entry *will* have quite a few entries and the index as a whole won't grow too big. Is my thinking correct here? You better off create a manuel reverse-index to track modification date, something like this -- I had considered an approach like this but my concern is that for any given minute *all* of the updates will be handled by a single node, right? For example, if the minute_bucket is 2739 then for that one minute, every single item update will flow to the node at HASH(2739). Assuming I am thinking about that right, that seemed like a potential scaling bottleneck, which scared me off that approach. Cheers, Ian On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Ian It sounds like this 100k limit is, indeed, a global limit as opposed to a per-row limit --The threshold applies to each REQUEST, not partition or globally. The threshold does not apply to a partition (physical row) simply because in one request you can fetch data from many partitions (multi get slice). There was a JIRA about this here: https://issues.apache.org/jira/browse/CASSANDRA-6865 Are these tombstones ever GCed out of the index? -- Yes they are, during compactions of the index column family. How frequently? -- That's the real pain. Indeed you do not have any control on the tuning of secondary index CF compaction. As far as I know, the compaction settings (strategy, min/max thresholds...) inherits from the one of the base table Now, by looking very fast into your data model, it seems that you have a skinny partition patter. Since you mentioned that the date is updated only 10 times max, you should not run into the tombstonne threshold issue. On a side node, your usage of secondary index is not the best one. Indeed, indexing the update date will lead to a situation where for one date, you'll mostly have one or a few matching items (assuming that the update date resolution is small enough and update rate is not intense). It is the high-cardinality scenario to be avoided ( http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html). Plus, the query on the index (find all items where last_updated [now - 30 minutes]) makes things worse since it is not an exact match but inequality. You better off create a manuel reverse-index to track modification date, something like this: CREATE TABLE last_updated_item ( minute_bucket int, // format MMDDHHmm last_update_date timestamp, item_id ascii, PRIMARY KEY(minute_bucket, last_update_date) ); The last_update_date column is quite self-explanatory. The minute_bucket is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole day, you'd have 48 buckets. We need to put data into buckets to avoid ultra wide rows since you mentioned that there are 10 items (so 10 updates) / sec. Of course, 30 mins is just an exemple, you can tune it down to a window of 5 minutes or 1 minute, depending on the insertion rate. On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com wrote: Hi Mark - Thanks for the clarification but as I'm not too familiar with the nuts bolts of Cassandra I'm not sure how to apply that info to my current situation. It sounds like this
Re: Node bootstrap
Still having issues with node bootstrapping. The new node just died, because it Full Gced, the nodes it had actual streams with noticed its down. After the full gc finished the new node printed this log : ERROR 02:52:36,259 Stream failed because /10.10.20.35 died or was restarted/removed (streams may still be active in background, but further streams won't be started) Here 10.10.20.35 is an existing node, the new guy was streaming from. A similar log was printed for every other node on the cluster. Why did the new node just exit after the FGC pause? We have heap dumps enabled on Full GC's and this are the top offenders on the new node. A new entry that I noticed is the CompressionMetaData chunks. Anything I can do to optimize that? num #instances #bytes class name -- 1: 42508421 4818885752 [B 2: 65860543 3161306064 java.nio.HeapByteBuffer 3: 124361093 2984666232 org.apache.cassandra.io.compress.CompressionMetadata$Chunk 4: 29745665 1427791920 edu.stanford.ppl.concurrent.SnapTreeMap$Node 5: 29810362 953931584 org.apache.cassandra.db.Column 6: 31623 498012768 [Lorg.apache.cassandra.io.compress.CompressionMetadata$Chunk; On Tue, Aug 5, 2014 at 2:59 PM, Ruchir Jha ruchir@gmail.com wrote: Also, right now the top command shows that we are at 500-700% CPU, and we have 23 total processors, which means we have a lot of idle CPU left over, so throwing more threads at compaction and flush should alleviate the problem? On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha ruchir@gmail.com wrote: Right now, we have 6 flush writers and compaction_throughput_mb_per_sec is set to 0, which I believe disables throttling. Also, Here is the iostat -x 5 5 output: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.00 1450.35 50.79 55.92 9775.97 12030.14 204.34 1.56 14.62 1.05 11.21 dm-0 0.00 0.003.59 18.82 166.52 150.35 14.14 0.44 19.49 0.54 1.22 dm-1 0.00 0.002.325.3718.5642.98 8.00 0.76 98.82 0.43 0.33 dm-2 0.00 0.00 162.17 5836.66 32714.46 47040.87 13.30 5.570.90 0.06 36.00 sdb 0.40 4251.90 106.72 107.35 23123.61 35204.09 272.46 4.43 20.68 1.29 27.64 avg-cpu: %user %nice %system %iowait %steal %idle 14.64 10.751.81 13.500.00 59.29 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 15.40 1344.60 68.80 145.60 4964.80 11790.40 78.15 0.381.80 0.80 17.10 dm-0 0.00 0.00 43.00 1186.20 2292.80 9489.60 9.59 4.883.90 0.09 11.58 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 16.00 2.00 0.32 dm-2 0.00 0.00 197.20 17583.80 35152.00 140664.00 9.89 2847.50 109.52 0.05 93.50 sdb 13.20 16552.20 159.00 742.20 32745.60 129129.60 179.6272.88 66.01 1.04 93.42 avg-cpu: %user %nice %system %iowait %steal %idle 15.51 19.771.975.020.00 57.73 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 16.20 523.40 60.00 285.00 5220.80 5913.60 32.27 0.250.72 0.60 20.86 dm-0 0.00 0.000.801.4032.0011.20 19.64 0.013.18 1.55 0.34 dm-1 0.00 0.001.600.0012.80 0.00 8.00 0.03 21.00 2.62 0.42 dm-2 0.00 0.00 339.40 5886.80 66219.20 47092.80 18.20 251.66 184.72 0.10 63.48 sdb 1.00 5025.40 264.20 209.20 60992.00 50422.40 235.35 5.98 40.92 1.23 58.28 avg-cpu: %user %nice %system %iowait %steal %idle 16.59 16.342.039.010.00 56.04 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 5.40 320.00 37.40 159.80 2483.20 3529.60 30.49 0.100.52 0.39 7.76 dm-0 0.00 0.000.203.60 1.6028.80 8.00 0.000.68 0.68 0.26 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-2 0.00 0.00 287.20 13108.20 53985.60 104864.00 11.86 869.18 48.82 0.06 76.96 sdb 5.20 12163.40 238.20 532.00 51235.20 93753.60 188.2521.46 23.75 0.97 75.08 On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy mark.re...@boxever.com wrote: Hi Ruchir, With the large number of blocked flushes and the number of pending compactions would still indicate IO contention. Can you post the output of 'iostat -x 5 5'
Re: clarification on 100k tombstone limit in indexes
Makes sense - thanks again! On Tue, Aug 12, 2014 at 9:45 AM, DuyHai Doan doanduy...@gmail.com wrote: Hello Ian So that way each index entry *will* have quite a few entries and the index as a whole won't grow too big. Is my thinking correct here? -- In this case yes. Do not forget that for each date value, there will be 1 corresponding index value + 10 updates. If you have an approximate count for a few entries, a quick maths should give you an idea about how large the index partition is I had considered an approach like this but my concern is that for any given minute *all* of the updates will be handled by a single node, right? -- If you time resolution is a minute, yes it will be a problem. And depending on the insert rate, it can become a quickly a bottle neck during this minute. The manual index approach suffers a lot from bottleneck issue for heavy workload, that's the main reason they implement a distributed secondary index. There is no free lunch though. What you gain in term of control and tuning with the manual index, you loose on the load distribution side. On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose ianr...@fullstory.com wrote: Hi DuyHai, Thanks for the detailed response! A few responses below: On a side node, your usage of secondary index is not the best one. Indeed, indexing the update date will lead to a situation where for one date, you'll mostly have one or a few matching items (assuming that the update date resolution is small enough and update rate is not intense). -- I should have mentioned this original (slipped my mind) but to deal specifically with this problem I had planned to use a timestamp with a resolution of 1 minute (like your minute_bucket). So that way each index entry *will* have quite a few entries and the index as a whole won't grow too big. Is my thinking correct here? You better off create a manuel reverse-index to track modification date, something like this -- I had considered an approach like this but my concern is that for any given minute *all* of the updates will be handled by a single node, right? For example, if the minute_bucket is 2739 then for that one minute, every single item update will flow to the node at HASH(2739). Assuming I am thinking about that right, that seemed like a potential scaling bottleneck, which scared me off that approach. Cheers, Ian On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Ian It sounds like this 100k limit is, indeed, a global limit as opposed to a per-row limit --The threshold applies to each REQUEST, not partition or globally. The threshold does not apply to a partition (physical row) simply because in one request you can fetch data from many partitions (multi get slice). There was a JIRA about this here: https://issues.apache.org/jira/browse/CASSANDRA-6865 Are these tombstones ever GCed out of the index? -- Yes they are, during compactions of the index column family. How frequently? -- That's the real pain. Indeed you do not have any control on the tuning of secondary index CF compaction. As far as I know, the compaction settings (strategy, min/max thresholds...) inherits from the one of the base table Now, by looking very fast into your data model, it seems that you have a skinny partition patter. Since you mentioned that the date is updated only 10 times max, you should not run into the tombstonne threshold issue. On a side node, your usage of secondary index is not the best one. Indeed, indexing the update date will lead to a situation where for one date, you'll mostly have one or a few matching items (assuming that the update date resolution is small enough and update rate is not intense). It is the high-cardinality scenario to be avoided ( http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html). Plus, the query on the index (find all items where last_updated [now - 30 minutes]) makes things worse since it is not an exact match but inequality. You better off create a manuel reverse-index to track modification date, something like this: CREATE TABLE last_updated_item ( minute_bucket int, // format MMDDHHmm last_update_date timestamp, item_id ascii, PRIMARY KEY(minute_bucket, last_update_date) ); The last_update_date column is quite self-explanatory. The minute_bucket is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole day, you'd have 48 buckets. We need to put data into buckets to avoid ultra wide rows since you mentioned that there are 10 items (so 10 updates) / sec. Of course, 30 mins is just an exemple, you can tune it down to a window of 5 minutes or 1 minute, depending on the insertion rate. On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com wrote: Hi Mark - Thanks for the clarification but as I'm not too
Re: Cassandra process exiting mysteriously
Hi Or, For now I removed the test that was failing like this from our suite and made a note to revisit it in a couple of weeks. Unfortunately I still don't know what the issue is. I'll post here if I figure out it (please do the same!). My working hypothesis now is that we had some kind of OOM problem. Best regards, Clint On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote: Clint, did you find anything? I just noticed it happens to us too on only one node in our CI cluster. I don't think there is a special usage before it happens... The last line in the log before the shutdown lines in at least an hour before.. We're using C* 2.0.9. On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi Rob, Thanks for the clarification; this is really useful. I'll run some experiments to see if the problem is a JVM OOM on our build machine. Best regards, Clint On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com wrote: this doesn't look like an OOM to me. If the kernel OOM kills Cassandra then Cassandra instantly vaporizes, and there will be nothing in the Cassandra logs (you will find information about the OOM in the system logs though, eg in dmesg). In the log snippet above you see an orderly shutdown, this is completely different to the instant OOM kill. Not really. https://issues.apache.org/jira/browse/CASSANDRA-7507 To be clear, there's two different OOMs here, I am talking about the JVM OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not necessarily result in the cassandra process dying, and can in fact trigger clean shutdown. System level OOM will in fact send the equivalent of KILL, which will not trigger the clean shutdown hook in Cassandra. =Rob -- Or Sher
OOM(Java heap space) on start-up during commit log replaying
Hi all, We have a node with commit log director ~4G. During start-up of the node on commit log replaying the used heap space is constantly growing ending with OOM error. The heap size and new heap size properties are - 1G and 256M. We are using the default settings for commitlog_sync, commitlog_sync_period_in_ms and commitlog_segment_size_in_mb. The log shows that cassandra is stuck on MutationStage: Active Pending Completed Blocked 16 385 196 0 The stack trace is: ERROR [metrics-meter-tick-thread-1] 2014-08-12 19:15:10,181 CassandraDaemon.java (line 198) Exception in thread Thread[metrics-meter-tick-thread-1,5,main] java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.AbstractQueuedSynchronizer.addWaiter(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(Unknown Source) at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.offer(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.add(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.add(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor.reExecutePeriodic(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ERROR [MutationStage:8] 2014-08-12 19:15:10,181 CassandraDaemon.java (line 198) Exception in thread Thread[MutationStage:8,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.duplicate(Unknown Source) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:62) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72) at org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:99) at org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35) at org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:188) at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:219) at org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:184) at org.apache.cassandra.db.Memtable.resolve(Memtable.java:226) at org.apache.cassandra.db.Memtable.put(Memtable.java:173) at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:893) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333) at org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:352) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ERROR [MutationStage:8] 2014-08-12 19:15:12,080 CassandraDaemon.java (line 198) Exception in thread Thread[MutationStage:8,5,main] java.lang.IllegalThreadStateException at java.lang.Thread.start(Unknown Source) at org.apache.cassandra.service.CassandraDaemon$2.uncaughtException(CassandraDaemon.java:204) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.handleOrLog(DebuggableThreadPoolExecutor.java:220) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.logExceptionsAfterExecute(DebuggableThreadPoolExecutor.java:203) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Increasing the heap space to 2G solves the problem but we want to know if the problem could be solved without increasing the heap space. Does anyone have experience similar problem? If so are there any tuning options in cassandra.yaml? Any help will be much appreciated. If you need more information fell free to ask. Thanks, Jivko Donev
Number of columns per row for composite columns?
Hi everyone, I'm confused with number of columns in a row of Cassandra, as far as I know there is 2 billions columns per row. Like that if I have a composite column name in each row, for ex: (timestamp, userid), then number of columns per row is the number of distinct 'timestamp' or each distinct 'timestamp, userid' is a column?
Re: OOM(Java heap space) on start-up during commit log replaying
On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote: We have a node with commit log director ~4G. During start-up of the node on commit log replaying the used heap space is constantly growing ending with OOM error. The heap size and new heap size properties are - 1G and 256M. We are using the default settings for commitlog_sync, commitlog_sync_period_in_ms and commitlog_segment_size_in_mb. What version of Cassandra? 1G is tiny for cassandra heap. There is a direct relationship between the data in the commitlog and memtables and in the heap. You almost certainly need more heap or less commitlog. =Rob
Re: Replacing a dead node in Cassandra 2.0.8
On Tue, Aug 12, 2014 at 4:33 AM, tsi thorsten.s...@t-systems.com wrote: In the datastax documentation there is a description how to replace a dead node ( http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html ). Is the replace_address option required even if the IP address of the new node is the same as the original one (I read a note about the auto bootstrapping being stored somewhere in the system tables)? In order for the node to bootstrap into ranges the rest of the cluster thinks it already owns, you will need to provide the ip in replace_address. This allows it to start up in a special way that is effectively bootstrap to the same tokens it previously had. =Rob
Re: clarification on 100k tombstone limit in indexes
On Mon, Aug 11, 2014 at 4:17 PM, Ian Rose ianr...@fullstory.com wrote: You better off create a manuel reverse-index to track modification date, something like this -- I had considered an approach like this but my concern is that for any given minute *all* of the updates will be handled by a single node, right? For example, if the minute_bucket is 2739 then for that one minute, every single item update will flow to the node at HASH(2739). Assuming I am thinking about that right, that seemed like a potential scaling bottleneck, which scared me off that approach. If you're concerned about bottlenecking on one node (or set of replicas) during the minute, add an additional integer column to the partition key (making it a composite partition key if it isn't already). When inserting, randomly pick a value between, say, 0 and 10 to use for this column. When reading, read all 10 partitions and merge them. (Alternatively, instead of using a random number, you could hash the other key components and use the lowest bits for the value. This has the advantage of being deterministic.) -- Tyler Hobbs DataStax http://datastax.com/
Re: OOM(Java heap space) on start-up during commit log replaying
Hi Robert, Thanks for your reply. The Cassandra version is 2.07. Is there some commonly used rule for determining the commitlog and memtables size depending on the heap size? What would be the main disadvantage when having smaller commitlog? On Tuesday, August 12, 2014 8:32 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote: We have a node with commit log director ~4G. During start-up of the node on commit log replaying the used heap space is constantly growing ending with OOM error. The heap size and new heap size properties are - 1G and 256M. We are using the default settings for commitlog_sync, commitlog_sync_period_in_ms and commitlog_segment_size_in_mb. What version of Cassandra? 1G is tiny for cassandra heap. There is a direct relationship between the data in the commitlog and memtables and in the heap. You almost certainly need more heap or less commitlog. =Rob
Re: Number of columns per row for composite columns?
Your question is a little too tangled for me... Are you asking about rows in a partition (some people call that a “storage row”) or columns per row? The latter is simply the number of columns that you have declared in your table. The total number of columns – or more properly, “cells” – in a partition would be the number of rows you have inserted in that partition times the number of columns you have declared in the table. If you need to review the terminology: http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows -- Jack Krupansky From: hlqv Sent: Tuesday, August 12, 2014 1:13 PM To: user@cassandra.apache.org Subject: Number of columns per row for composite columns? Hi everyone, I'm confused with number of columns in a row of Cassandra, as far as I know there is 2 billions columns per row. Like that if I have a composite column name in each row, for ex: (timestamp, userid), then number of columns per row is the number of distinct 'timestamp' or each distinct 'timestamp, userid' is a column?
Nodetool Repair questions
Some questions on nodetool repair. 1. This tool repairs inconsistencies across replicas of the row. Since latest update always wins, I dont see inconsistencies other than ones resulting from the combination of deletes, tombstones, and crashed nodes. Technically, if data is never deleted from cassandra, then nodetool repair does not need to be run at all. Is this understanding correct? If wrong, can anyone provide other ways inconsistencies could occur? 2. Want to understand the performance of 'nodetool repair' in a Cassandra multi data center setup. As we add nodes to the cluster in various data centers, does the performance of nodetool repair on each node increase linearly, or is it quadratic ? The essence of this question is: If I have a keyspace with x number of replicas in each data center, do I have to deal with an upper limit on the number of data centers/nodes? Thanks Vish
Re: Nodetool Repair questions
Hi Vish, 1. This tool repairs inconsistencies across replicas of the row. Since latest update always wins, I dont see inconsistencies other than ones resulting from the combination of deletes, tombstones, and crashed nodes. Technically, if data is never deleted from cassandra, then nodetool repair does not need to be run at all. Is this understanding correct? If wrong, can anyone provide other ways inconsistencies could occur? Even if you never delete data you should run repairs occasionally to ensure overall consistency. While hinted handoffs and read repairs do lead to better consistency, they are only helpers/optimization and are not regarded as operations that ensure consistency. 2. Want to understand the performance of 'nodetool repair' in a Cassandra multi data center setup. As we add nodes to the cluster in various data centers, does the performance of nodetool repair on each node increase linearly, or is it quadratic ? Its difficult to calculate the performance of a repair, I've seen the time to completion fluctuate between 4hrs to 10hrs+ on the same node. However in theory adding more nodes would spread the data and free up machine resources, thus resulting in more performant repairs. The essence of this question is: If I have a keyspace with x number of replicas in each data center, do I have to deal with an upper limit on the number of data centers/nodes? Could you expand on why you believe there would be an upper limit of dc/nodes due to running repairs? Mark On Tue, Aug 12, 2014 at 10:06 PM, Viswanathan Ramachandran vish.ramachand...@gmail.com wrote: Some questions on nodetool repair. 1. This tool repairs inconsistencies across replicas of the row. Since latest update always wins, I dont see inconsistencies other than ones resulting from the combination of deletes, tombstones, and crashed nodes. Technically, if data is never deleted from cassandra, then nodetool repair does not need to be run at all. Is this understanding correct? If wrong, can anyone provide other ways inconsistencies could occur? 2. Want to understand the performance of 'nodetool repair' in a Cassandra multi data center setup. As we add nodes to the cluster in various data centers, does the performance of nodetool repair on each node increase linearly, or is it quadratic ? The essence of this question is: If I have a keyspace with x number of replicas in each data center, do I have to deal with an upper limit on the number of data centers/nodes? Thanks Vish
Re: OOM(Java heap space) on start-up during commit log replaying
Agreed need more details; and just start by increasing heap because that may wells solve the problem. I have just observed (which makes sense when you think about it) while testing fix for https://issues.apache.org/jira/browse/CASSANDRA-7546, that if you are replaying a commit log which has a high level of updates for the same partition key, you can hit that issue - excess memory allocation under high contention for the same partition key - (this might not cause OOM but will certainly massively tax GC and it sounds like you don’t have a lot/any headroom). On Aug 12, 2014, at 12:31 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote: We have a node with commit log director ~4G. During start-up of the node on commit log replaying the used heap space is constantly growing ending with OOM error. The heap size and new heap size properties are - 1G and 256M. We are using the default settings for commitlog_sync, commitlog_sync_period_in_ms and commitlog_segment_size_in_mb. What version of Cassandra? 1G is tiny for cassandra heap. There is a direct relationship between the data in the commitlog and memtables and in the heap. You almost certainly need more heap or less commitlog. =Rob smime.p7s Description: S/MIME cryptographic signature
Re: Nodetool Repair questions
1. You don't have to repair if you use QUORUM consistency and you don't delete data. 2.Performance depends on size of data each node has. It's very difficult to predict. It may take days. Thank you, Andrey On Tue, Aug 12, 2014 at 2:06 PM, Viswanathan Ramachandran vish.ramachand...@gmail.com wrote: Some questions on nodetool repair. 1. This tool repairs inconsistencies across replicas of the row. Since latest update always wins, I dont see inconsistencies other than ones resulting from the combination of deletes, tombstones, and crashed nodes. Technically, if data is never deleted from cassandra, then nodetool repair does not need to be run at all. Is this understanding correct? If wrong, can anyone provide other ways inconsistencies could occur? 2. Want to understand the performance of 'nodetool repair' in a Cassandra multi data center setup. As we add nodes to the cluster in various data centers, does the performance of nodetool repair on each node increase linearly, or is it quadratic ? The essence of this question is: If I have a keyspace with x number of replicas in each data center, do I have to deal with an upper limit on the number of data centers/nodes? Thanks Vish
Re: Nodetool Repair questions
Thanks Mark, Since we have replicas in each data center, addition of a new data center (and new replicas) has a performance implication on nodetool repair. I do understand that adding nodes without increasing number of replicas may improve repair performance, but in this case we are adding new data center and additional replicas which is an added overhead on nodetool repair. Hence the thinking that we may reach an upper limit which could be the point when the nodetool repair costs are way too high. On Tue, Aug 12, 2014 at 2:59 PM, Mark Reddy mark.re...@boxever.com wrote: Hi Vish, 1. This tool repairs inconsistencies across replicas of the row. Since latest update always wins, I dont see inconsistencies other than ones resulting from the combination of deletes, tombstones, and crashed nodes. Technically, if data is never deleted from cassandra, then nodetool repair does not need to be run at all. Is this understanding correct? If wrong, can anyone provide other ways inconsistencies could occur? Even if you never delete data you should run repairs occasionally to ensure overall consistency. While hinted handoffs and read repairs do lead to better consistency, they are only helpers/optimization and are not regarded as operations that ensure consistency. 2. Want to understand the performance of 'nodetool repair' in a Cassandra multi data center setup. As we add nodes to the cluster in various data centers, does the performance of nodetool repair on each node increase linearly, or is it quadratic ? Its difficult to calculate the performance of a repair, I've seen the time to completion fluctuate between 4hrs to 10hrs+ on the same node. However in theory adding more nodes would spread the data and free up machine resources, thus resulting in more performant repairs. The essence of this question is: If I have a keyspace with x number of replicas in each data center, do I have to deal with an upper limit on the number of data centers/nodes? Could you expand on why you believe there would be an upper limit of dc/nodes due to running repairs? Mark On Tue, Aug 12, 2014 at 10:06 PM, Viswanathan Ramachandran vish.ramachand...@gmail.com wrote: Some questions on nodetool repair. 1. This tool repairs inconsistencies across replicas of the row. Since latest update always wins, I dont see inconsistencies other than ones resulting from the combination of deletes, tombstones, and crashed nodes. Technically, if data is never deleted from cassandra, then nodetool repair does not need to be run at all. Is this understanding correct? If wrong, can anyone provide other ways inconsistencies could occur? 2. Want to understand the performance of 'nodetool repair' in a Cassandra multi data center setup. As we add nodes to the cluster in various data centers, does the performance of nodetool repair on each node increase linearly, or is it quadratic ? The essence of this question is: If I have a keyspace with x number of replicas in each data center, do I have to deal with an upper limit on the number of data centers/nodes? Thanks Vish
Re: Nodetool Repair questions
Andrey, QUORUM consistency and no deletes makes perfect sense. I believe we could modify that to EACH_QUORUM or QUORUM consistency and no deletes - isnt that right ? Thanks On Tue, Aug 12, 2014 at 3:10 PM, Andrey Ilinykh ailin...@gmail.com wrote: 1. You don't have to repair if you use QUORUM consistency and you don't delete data. 2.Performance depends on size of data each node has. It's very difficult to predict. It may take days. Thank you, Andrey On Tue, Aug 12, 2014 at 2:06 PM, Viswanathan Ramachandran vish.ramachand...@gmail.com wrote: Some questions on nodetool repair. 1. This tool repairs inconsistencies across replicas of the row. Since latest update always wins, I dont see inconsistencies other than ones resulting from the combination of deletes, tombstones, and crashed nodes. Technically, if data is never deleted from cassandra, then nodetool repair does not need to be run at all. Is this understanding correct? If wrong, can anyone provide other ways inconsistencies could occur? 2. Want to understand the performance of 'nodetool repair' in a Cassandra multi data center setup. As we add nodes to the cluster in various data centers, does the performance of nodetool repair on each node increase linearly, or is it quadratic ? The essence of this question is: If I have a keyspace with x number of replicas in each data center, do I have to deal with an upper limit on the number of data centers/nodes? Thanks Vish
range query times out (on 1 node, just 1 row in table)
Hi - I am currently running a single Cassandra node on my local dev machine. Here is my (test) schema (which is meaningless, I created it just to demonstrate the issue I am running into): CREATE TABLE foo ( foo_name ascii, foo_shard bigint, int_val bigint, PRIMARY KEY ((foo_name, foo_shard)) ) WITH read_repair_chance=0.1; CREATE INDEX ON foo (int_val); CREATE INDEX ON foo (foo_name); I have inserted just a single row into this table: insert into foo(foo_name, foo_shard, int_val) values('dave', 27, 100); This query works fine: select * from foo where foo_name='dave'; But when I run this query, I get an RPC timeout: select * from foo where foo_name='dave' and int_val 0 allow filtering; With tracing enabled, here is the trace output: http://pastebin.com/raw.php?i=6XMEVUcQ (In short, everything looks fine to my untrained eye until 10s elapsed, at which time the following event is logged: Timed out; received 0 of 1 responses for range 257 of 257) Can anyone help interpret this error? Many thanks! Ian
Re: Nodetool Repair questions
On Tue, Aug 12, 2014 at 4:46 PM, Viswanathan Ramachandran vish.ramachand...@gmail.com wrote: Andrey, QUORUM consistency and no deletes makes perfect sense. I believe we could modify that to EACH_QUORUM or QUORUM consistency and no deletes - isnt that right? yes.