How to change the seed node Cassandra 1.0.11
Hi In our production, we have 3 Cassandra 1.0.11 nodes. Due to a reason, I want to move the current seed node to another node and once seed node change, the previous node want to remove from cluster. How can I do that? Thanks. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-to-change-the-seed-node-Cassandra-1-0-11-tp7583338.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
Yes, all this starts because of the call to filter.collateColumns()… The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer , the methods to add columns on that interface pass through to an implementation of ISortedColumns. The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will call reconcile() on the IColumn if they need to. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 4:45 AM, Manu Zhang owenzhang1...@gmail.com wrote: Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE) and then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what happens hereafter? How is reconcile exactly been called? On Mon, Oct 22, 2012 at 6:49 AM, aaron morton aa...@thelastpickle.com wrote: There are two processes in cassandra that trigger Read Repair like behaviour. During a DigestMismatchException is raised if the responses from the replicas do not match. In this case another read is run that involves reading all the data. This is the CL level agreement kicking in. The other Read Repair is the one controlled by the read_repair_chance. When RR is active on a request ALL up replicas are involved in the read. When RR is not active only CL replicas are involved. When test for CL agreement occurs synchronously to the request; the RR check waits asynchronously to the request for all nodes in the request to return. It then checks for consistency and repairs differences. From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. When a DigestMismatch is detected a read is run using RepairCallback. The callback will call the RowRepairResolver.resolve() when enough responses have been collected. resolveSuperset() picks one response to the baseline, and then calls delete() to apply row level deletes from the other responses (ColumnFamily's). It collects the other CF's into an iterator with a filter that returns all columns. The columns are then applied to the baseline CF which may result in reconcile() being called. reconcile() is used when a AbstractColumnContainer has two versions of a column and it wants to only have one. RowRepairResolve.scheduleRepairs() works out the delta for each node by calling ColumnFamily.diff(). The delta is then sent to the appropriate node. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 19/10/2012, at 6:33 AM, Markus Klems markuskl...@gmail.com wrote: Hi guys, I am looking through the Cassandra source code in the github trunk to better understand how Cassandra's fault-tolerance mechanisms work. Most things make sense. I am also aware of the wiki and DataStax documentation. However, I do not understand what read repair does in detail. The method RowRepairResolver.resolveSuperset(IterableColumnFamily versions) seems to do the trick of merging conflicting versions of column family replicas and builds the set of columns that need to be repaired. From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. ReadRepair does not seem to trigger a Column.reconcile() to reconcile conflicting column versions on different servers. Does it? If this is not what read repair does, then: What kind of inconsistencies are resolved by read repair? And: How are the inconsistencies resolved? Could someone give me a hint? Thanks so much, -Markus
Re: Node Dead/Up
check 10.50.10.21 for what is the system load. On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com wrote: Hello, I'm on version 1.0.11. I'm seeing this in my system log with occasional frequency: INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818) InetAddress /10.50.10.21 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804) InetAddress /10.50.10.21 is now UP INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java (line 228) Streaming to /10.50.10.25 --this line included for context INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818) InetAddress /10.50.10.25 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804) InetAddress /10.50.10.25 is now UP INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249 AntiEntropyService.java (line 233) [repair #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included for context What is this telling me? Is my network dropping for less than a second? Are my nodes really dead and then up? Can someone shed some light on this for me? cheers, Jason
Re: tuning for read performance
and nodetool tpstats always shows pending tasks in the ReadStage. Are clients reading a single row at a time or multiple rows ? Each row requested in a multi get becomes a task in the read stage. Also look at the type of query you are sending. I talked a little about the performance of different query techniques at Cassandra SFhttp://www.datastax.com/events/cassandrasummit2012/presentations 1. Consider Leveled compaction instead of Size Tiered. LCS improves read performance at the cost of more writes. I would look at other options first. If you want to know how many SSTables a read is hitting look at nodetool cfhistograms 2. You said skinny column family which I took to mean not a lot of columns/row. See if you can organize your data into wider rows which allow reading fewer rows and thus fewer queries/disk seeks. Wide rows take longer to read than narrow ones. Artificially wide rows may take longer to read than narrow ones. 4. Splitting your data from your MetaData could definitely help. I like separating my read heavy from write heavy CF's because generally speaking they benefit from different compaction methods. But don't go crazy creating 1000's of CF's either. +1 25 ms read latency is high. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 9:06 AM, Aaron Turner synfina...@gmail.com wrote: On Mon, Oct 22, 2012 at 11:05 AM, feedly team feedly...@gmail.com wrote: Hi, I have a small 2 node cassandra cluster that seems to be constrained by read throughput. There are about 100 writes/s and 60 reads/s mostly against a skinny column family. Here's the cfstats for that family: SSTable count: 13 Space used (live): 231920026568 Space used (total): 231920026568 Number of Keys (estimate): 356899200 Memtable Columns Count: 1385568 Memtable Data Size: 359155691 Memtable Switch Count: 26 Read Count: 40705879 Read Latency: 25.010 ms. Write Count: 9680958 Write Latency: 0.036 ms. Pending Tasks: 0 Bloom Filter False Postives: 28380 Bloom Filter False Ratio: 0.00360 Bloom Filter Space Used: 874173664 Compacted row minimum size: 61 Compacted row maximum size: 152321 Compacted row mean size: 1445 iostat shows almost no write activity, here's a typical line: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 312.870.00 6.61 0.0043.27 23.35 105.06 2.28 71.19 and nodetool tpstats always shows pending tasks in the ReadStage. The data set has grown beyond physical memory (250GB/node w/64GB of RAM) so I know disk access is required, but are there particular settings I should experiment with that could help relieve some read i/o pressure? I already put memcached in front of cassandra so the row cache probably won't help much. Also this column family stores smallish documents (usually 1-100K) along with metadata. The document is only occasionally accessed, usually only the metadata is read/written. Would splitting out the document into a separate column family help? Some un-expert advice: 1. Consider Leveled compaction instead of Size Tiered. LCS improves read performance at the cost of more writes. 2. You said skinny column family which I took to mean not a lot of columns/row. See if you can organize your data into wider rows which allow reading fewer rows and thus fewer queries/disk seeks. 3. Enable compression if you haven't already. 4. Splitting your data from your MetaData could definitely help. I like separating my read heavy from write heavy CF's because generally speaking they benefit from different compaction methods. But don't go crazy creating 1000's of CF's either. Hope that gives you some ideas to investigate further! -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: Strange row expiration behavior
Performing these steps results in the rows still being present using cassandra-cli list. I assume you are saying the row key is listed without any columns. aka a ghost row. What gets really odd is if I add these steps it works That's working as designed. gc_grace_seconds does not specify when tombstones must be purged, rather it specifies the minimum duration the tombstone must be stored. It's really saying if you compact this column X seconds after the delete you can purge the tombstone. Minor / automatic compaction will kick in if there are (by default) 4 SSTables of the same size. And will only purge tombstones if all fragments of the row exists in the SSTables being compaction. Major / manual compaction compacts all the sstables, and so purges the tombstones IF gc_grace_seconds has expired. In your first example compaction had not run so the tombstones stayed on disk. In the second the major compaction purged expired tombstones. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com wrote: Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12. I'm trying to test removing rows after all columns have expired. I've read the following: http://wiki.apache.org/cassandra/DistributedDeletes http://wiki.apache.org/cassandra/MemtableSSTable https://issues.apache.org/jira/browse/CASSANDRA-2795 And came up with a test to demonstrate the empty row removal that does the following: create a keyspace create a column family with gc_seconds=10 (arbitrary small number) insert a couple rows with ttl=5 (again, just a small number) use nodetool to flush the column family sleep 10 seconds ensure the columns are removed with cassandra-cli list use nodetool to compact the keyspace Performing these steps results in the rows still being present using cassandra-cli list. What gets really odd is if I add these steps it works: sleep 5 seconds use cassandra-cli to del mycf[arow] use nodetool to flush the column family use nodetool to compact the keyspace I don't understand why the first set of steps (1-7) don't work to remove the empty row, nor do I understand why the explicit row delete somehow makes this work. I have all this in a script that I could attach if that's appropriate. Is there something wrong with the steps that I have? Thanks, Stephen
Re: nodetool cleanup
what is the internal memory model used? It sounds like it doesn't have a page manager? Nodetool cleanup is a maintenance process to remove data on disk that the node is no longer a replica for. It is typically used after the token assignments have been changed. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 6:42 PM, Will @ SOHO w...@voodoolunchbox.com wrote: On 10/23/2012 01:25 AM, Peter Schuller wrote: On Oct 22, 2012 11:54 AM, B. Todd Burruss bto...@gmail.com wrote: does nodetool cleanup perform a major compaction in the process of removing unwanted data? No. what is the internal memory model used? It sounds like it doesn't have a page manager?
Re: How to change the seed node Cassandra 1.0.11
Just change the yaml and restart. The seed list is not persisted i the System KS (like the token assignment). I would suggest running 2 or 3 seeds in your cluster, even if you only have 3 nodes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 7:13 PM, Roshan codeva...@gmail.com wrote: Hi In our production, we have 3 Cassandra 1.0.11 nodes. Due to a reason, I want to move the current seed node to another node and once seed node change, the previous node want to remove from cluster. How can I do that? Thanks. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-to-change-the-seed-node-Cassandra-1-0-11-tp7583338.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Node Dead/Up
check 10.50.10.21 for what is the system load. +1 And take a look in the logs on 10.21. 10.21 is being seen as down by the other nodes. it could be: * 10.21 failing to gossip fast enough, say by being overloaded to in long ParNew GC pauses. * This node failing to process gossip fast , say by being overloaded to in long ParNew GC pauses. * Problems with the tubes used to connect the nodes. (It's probably the first one.) Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 8:19 PM, Jason Wee peich...@gmail.com wrote: check 10.50.10.21 for what is the system load. On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com wrote: Hello, I'm on version 1.0.11. I'm seeing this in my system log with occasional frequency: INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818) InetAddress /10.50.10.21 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804) InetAddress /10.50.10.21 is now UP INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java (line 228) Streaming to /10.50.10.25 --this line included for context INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818) InetAddress /10.50.10.25 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804) InetAddress /10.50.10.25 is now UP INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249 AntiEntropyService.java (line 233) [repair #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included for context What is this telling me? Is my network dropping for less than a second? Are my nodes really dead and then up? Can someone shed some light on this for me? cheers, Jason
Re: Strange row expiration behavior
Thanks Aaron, my reply is inline below: On Tue, Oct 23, 2012 at 2:38 AM, aaron morton aa...@thelastpickle.comwrote: Performing these steps results in the rows still being present using *cassandra-cli list*. I assume you are saying the row key is listed without any columns. aka a ghost row. Correct. What gets really odd is if I add these steps it works That's working as designed. gc_grace_seconds does not specify when tombstones must be purged, rather it specifies the minimum duration the tombstone must be stored. It's really saying if you compact this column X seconds after the delete you can purge the tombstone. Minor / automatic compaction will kick in if there are (by default) 4 SSTables of the same size. And will only purge tombstones if all fragments of the row exists in the SSTables being compaction. Major / manual compaction compacts all the sstables, and so purges the tombstones IF gc_grace_seconds has expired. In your first example compaction had not run so the tombstones stayed on disk. In the second the major compaction purged expired tombstones. In the first example, I am running compaction at step 7 through nodetool, after gc_grace_seconds has expired. Additionally, if I do not perform the manual delete of the row in the second example, the ghost rows are not cleaned up. I want to know that in our production environment, I don't have to manually delete empty rows after the columns expire. But I can't get an example working to that effect. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com wrote: Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12. I'm trying to test removing rows after all columns have expired. I've read the following: http://wiki.apache.org/cassandra/DistributedDeletes http://wiki.apache.org/cassandra/MemtableSSTable https://issues.apache.org/jira/browse/CASSANDRA-2795 And came up with a test to demonstrate the empty row removal that does the following: 1. create a keyspace 2. create a column family with gc_seconds=10 (arbitrary small number) 3. insert a couple rows with ttl=5 (again, just a small number) 4. use nodetool to flush the column family 5. sleep 10 seconds 6. ensure the columns are removed with *cassandra-cli list * 7. use nodetool to compact the keyspace Performing these steps results in the rows still being present using *cassandra-cli list*. What gets really odd is if I add these steps it works: 1. sleep 5 seconds 2. use cassandra-cli to *del mycf[arow]* 3. use nodetool to flush the column family 4. use nodetool to compact the keyspace I don't understand why the first set of steps (1-7) don't work to remove the empty row, nor do I understand why the explicit row delete somehow makes this work. I have all this in a script that I could attach if that's appropriate. Is there something wrong with the steps that I have? Thanks, Stephen
Re: constant CMS GC using CPU time
These GC settings are the default (recommended?) settings from cassandra-env. I added the UseCompressedOops. -Bryan On Mon, Oct 22, 2012 at 6:15 PM, Will @ SOHO w...@voodoolunchbox.comwrote: On 10/22/2012 09:05 PM, aaron morton wrote: # GC tuning options JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops You are too far behind the reference JVM's. Parallel GC is the preferred and highest performing form in the current Security Baseline version of the JVM's. -- Bryan Talbot Architect / Platform team lead, Aeria Games and Entertainment Silicon Valley | Berlin | Tokyo | Sao Paulo
Re: nodetool cleanup
since SSTABLEs are immutable, it must create new SSTABLEs without the data that the node is no longer a replica for ... but it doesn't remove deleted data. seems like a possible optimization to also removed deleted data and tombstone cleanup ... but i guess cleanup shouldn't really be used that much. thx On Tue, Oct 23, 2012 at 12:40 AM, aaron morton aa...@thelastpickle.com wrote: what is the internal memory model used? It sounds like it doesn't have a page manager? Nodetool cleanup is a maintenance process to remove data on disk that the node is no longer a replica for. It is typically used after the token assignments have been changed. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 6:42 PM, Will @ SOHO w...@voodoolunchbox.com wrote: On 10/23/2012 01:25 AM, Peter Schuller wrote: On Oct 22, 2012 11:54 AM, B. Todd Burruss bto...@gmail.com wrote: does nodetool cleanup perform a major compaction in the process of removing unwanted data? No. what is the internal memory model used? It sounds like it doesn't have a page manager?
Re: What does ReadRepair exactly do?
Hello, This conversation precisely targets a question that I had been having for a while - would be grateful if you someone cloud clarify it a little further: Considering the case of a repair created due to a consistency constraint (first case in the discussion above), would the following interpretation be correct ? 1. A digest mismatch exception is raised even if one among the many responses (even if consistency is met on an out-of-date value, say by virtue of timestamp). 2. A read is initiated by the callback to fetch data from all replicas 3. Resolve() is invoked to find the deltas for each replica that was out of date. 4. ReadRepair is scheduled to the above replicas. 5. Perform a normal read and check if this meets the consistency constraints. Mismatches would trigger a repair again. Assuming the above is true, would the mutations in step 4 and the read in step 5 happen in parallel ? In other words, would the time taken by the read correction be the round trip between the coordinator and its farthest replica that meets the consistency constraint. Thanks, Shankar On Tue, Oct 23, 2012 at 3:17 AM, aaron morton aa...@thelastpickle.comwrote: Yes, all this starts because of the call to filter.collateColumns()… The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer , the methods to add columns on that interface pass through to an implementation of ISortedColumns. The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will call reconcile() on the IColumn if they need to. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 4:45 AM, Manu Zhang owenzhang1...@gmail.com wrote: Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE) and then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what happens hereafter? How is reconcile exactly been called? On Mon, Oct 22, 2012 at 6:49 AM, aaron morton aa...@thelastpickle.comwrote: There are two processes in cassandra that trigger Read Repair like behaviour. During a DigestMismatchException is raised if the responses from the replicas do not match. In this case another read is run that involves reading all the data. This is the CL level agreement kicking in. The other Read Repair is the one controlled by the read_repair_chance. When RR is active on a request ALL up replicas are involved in the read. When RR is not active only CL replicas are involved. When test for CL agreement occurs synchronously to the request; the RR check waits asynchronously to the request for all nodes in the request to return. It then checks for consistency and repairs differences. From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. When a DigestMismatch is detected a read is run using RepairCallback. The callback will call the RowRepairResolver.resolve() when enough responses have been collected. resolveSuperset() picks one response to the baseline, and then calls delete() to apply row level deletes from the other responses (ColumnFamily's). It collects the other CF's into an iterator with a filter that returns all columns. The columns are then applied to the baseline CF which may result in reconcile() being called. reconcile() is used when a AbstractColumnContainer has two versions of a column and it wants to only have one. RowRepairResolve.scheduleRepairs() works out the delta for each node by calling ColumnFamily.diff(). The delta is then sent to the appropriate node. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 19/10/2012, at 6:33 AM, Markus Klems markuskl...@gmail.com wrote: Hi guys, I am looking through the Cassandra source code in the github trunk to better understand how Cassandra's fault-tolerance mechanisms work. Most things make sense. I am also aware of the wiki and DataStax documentation. However, I do not understand what read repair does in detail. The method RowRepairResolver.resolveSuperset(IterableColumnFamily versions) seems to do the trick of merging conflicting versions of column family replicas and builds the set of columns that need to be repaired. From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. ReadRepair does not seem to trigger a Column.reconcile() to reconcile conflicting column versions on different servers. Does it? If this is not what read repair does, then: What kind of inconsistencies are resolved by read repair? And: How are the inconsistencies resolved? Could someone give me a hint? Thanks so much, -Markus
Re: Strange row expiration behavior
In the first example, I am running compaction at step 7 through nodetool, Sorry missed that. insert a couple rows with ttl=5 (again, just a small number) ExpiringColumn's are only purged if their TTL has expired AND their absolute (node local) expiry time occurred before the current gcBefore time. This may have explained why the columns were not purged in the first compaction. Can you try your first steps again. And then for the second set of steps add a new row, flush, compact. The expired rows should be removed. I don't have to manually delete empty rows after the columns expire. . Rows are automatically purged when all columns are purged. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/10/2012, at 3:05 AM, Stephen Mullins smull...@thebrighttag.com wrote: Thanks Aaron, my reply is inline below: On Tue, Oct 23, 2012 at 2:38 AM, aaron morton aa...@thelastpickle.com wrote: Performing these steps results in the rows still being present using cassandra-cli list. I assume you are saying the row key is listed without any columns. aka a ghost row. Correct. What gets really odd is if I add these steps it works That's working as designed. gc_grace_seconds does not specify when tombstones must be purged, rather it specifies the minimum duration the tombstone must be stored. It's really saying if you compact this column X seconds after the delete you can purge the tombstone. Minor / automatic compaction will kick in if there are (by default) 4 SSTables of the same size. And will only purge tombstones if all fragments of the row exists in the SSTables being compaction. Major / manual compaction compacts all the sstables, and so purges the tombstones IF gc_grace_seconds has expired. In your first example compaction had not run so the tombstones stayed on disk. In the second the major compaction purged expired tombstones. In the first example, I am running compaction at step 7 through nodetool, after gc_grace_seconds has expired. Additionally, if I do not perform the manual delete of the row in the second example, the ghost rows are not cleaned up. I want to know that in our production environment, I don't have to manually delete empty rows after the columns expire. But I can't get an example working to that effect. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com wrote: Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12. I'm trying to test removing rows after all columns have expired. I've read the following: http://wiki.apache.org/cassandra/DistributedDeletes http://wiki.apache.org/cassandra/MemtableSSTable https://issues.apache.org/jira/browse/CASSANDRA-2795 And came up with a test to demonstrate the empty row removal that does the following: create a keyspace create a column family with gc_seconds=10 (arbitrary small number) insert a couple rows with ttl=5 (again, just a small number) use nodetool to flush the column family sleep 10 seconds ensure the columns are removed with cassandra-cli list use nodetool to compact the keyspace Performing these steps results in the rows still being present using cassandra-cli list. What gets really odd is if I add these steps it works: sleep 5 seconds use cassandra-cli to del mycf[arow] use nodetool to flush the column family use nodetool to compact the keyspace I don't understand why the first set of steps (1-7) don't work to remove the empty row, nor do I understand why the explicit row delete somehow makes this work. I have all this in a script that I could attach if that's appropriate. Is there something wrong with the steps that I have? Thanks, Stephen
Re: constant CMS GC using CPU time
On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote: The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. If you restart a node is the onset of GC activity correlated to some event? Yes and no. When the nodes were generally under the .75 occupancy threshold a weekly repair -pr job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions. It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete. Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). How many keys did the CF have per node? I dismissed the memory used to hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. I've changed the app to retain less data for that CF but I think that it was about 400M rows per node. Row keys are a TimeUUID. All of the rows are write-once, never updated, and rarely read. There are no secondary indexes for this particular CF. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 The default should be 0.000744. If the chance is zero or null this code should run when a new SSTable is written // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, let's not let that break things logger.error(Bloom filter FP chance of zero isn't supposed to happen); Were the CF's migrated from an old version ? Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a upgradesstables run at each upgrade along the way. I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value. JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in describe table but I couldn't find out how to tell what the default was. I didn't inspect the source. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? See o.a.c.utils.BloomFilter.getFilter() in the code This http://hur.st/bloomfilter appears to give similar results. Ahh, very helpful. This indicates that 714MB would be used for the bloom filter for that one CF. JMX / cfstats reports Bloom Filter Space Used but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the Bloom Filter Space Used says they're currently consuming 1-2 GB of the heap which is substantial. If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0? It just means more trips to SSTable indexes for a read correct? Trade RAM for time (disk I/O). -Bryan
Re: What does ReadRepair exactly do?
Hello, This conversation precisely targets a question that I had been having for a while - would be grateful if you someone cloud clarify it a little further: Considering the case of a repair created due to a consistency constraint (first case in the discussion above), would the following interpretation be correct ? 1. A digest mismatch exception is raised even if one among the many responses (even if consistency is met on an out-of-date value, say by virtue of timestamp). 2. A read is initiated by the callback to fetch data from all replicas 3. Resolve() is invoked to find the deltas for each replica that was out of date. 4. ReadRepair is scheduled to the above replicas. 5. Perform a normal read and check if this meets the consistency constraints. Mismatches would trigger a repair again. Assuming the above is true, would the mutations in step 4 and the read in step 5 happen in parallel ? In other words, would the time taken by the read correction be the round trip between the coordinator and its farthest replica that meets the consistency constraint. Thanks, Shankar -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Node Dead/Up
thanks for the replies. I'll check the load on the node that is reported as DOWN/UP. At first glace it does not appear to be overloaded. But, I will dig in deeper, is there a specific indicator on an ubuntu server that would be useful to me? Also, I didn't make it clear, but in my original post, there are logs from 2 different nodes: 10.21 and 10.25. They are each reporting that the other is DOWN/UP at the same time. Would that still point me to the suggestions you made? I don't see errors in the logs, but I do see a lot of dropped mutations and reads. Any correlation? thanks again, Jason On Tue, Oct 23, 2012 at 12:49 AM, aaron morton aa...@thelastpickle.com wrote: check 10.50.10.21 for what is the system load. +1 And take a look in the logs on 10.21. 10.21 is being seen as down by the other nodes. it could be: * 10.21 failing to gossip fast enough, say by being overloaded to in long ParNew GC pauses. * This node failing to process gossip fast , say by being overloaded to in long ParNew GC pauses. * Problems with the tubes used to connect the nodes. (It's probably the first one.) Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 8:19 PM, Jason Wee peich...@gmail.com wrote: check 10.50.10.21 for what is the system load. On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com wrote: Hello, I'm on version 1.0.11. I'm seeing this in my system log with occasional frequency: INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818) InetAddress /10.50.10.21 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804) InetAddress /10.50.10.21 is now UP INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java (line 228) Streaming to /10.50.10.25 --this line included for context INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818) InetAddress /10.50.10.25 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804) InetAddress /10.50.10.25 is now UP INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249 AntiEntropyService.java (line 233) [repair #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included for context What is this telling me? Is my network dropping for less than a second? Are my nodes really dead and then up? Can someone shed some light on this for me? cheers, Jason
Re: What does ReadRepair exactly do?
why repair again? We block until the consistency constraint is met. Then the latest version is returned and repair is done asynchronously if any mismatch. We may retry read if fewer columns than required are returned. On Wed, Oct 24, 2012 at 6:10 AM, shankarpnsn shankarp...@gmail.com wrote: Hello, This conversation precisely targets a question that I had been having for a while - would be grateful if you someone cloud clarify it a little further: Considering the case of a repair created due to a consistency constraint (first case in the discussion above), would the following interpretation be correct ? 1. A digest mismatch exception is raised even if one among the many responses (even if consistency is met on an out-of-date value, say by virtue of timestamp). 2. A read is initiated by the callback to fetch data from all replicas 3. Resolve() is invoked to find the deltas for each replica that was out of date. 4. ReadRepair is scheduled to the above replicas. 5. Perform a normal read and check if this meets the consistency constraints. Mismatches would trigger a repair again. Assuming the above is true, would the mutations in step 4 and the read in step 5 happen in parallel ? In other words, would the time taken by the read correction be the round trip between the coordinator and its farthest replica that meets the consistency constraint. Thanks, Shankar -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
manuzhang wrote why repair again? We block until the consistency constraint is met. Then the latest version is returned and repair is done asynchronously if any mismatch. We may retry read if fewer columns than required are returned. Just to make sure I understand you correct, considering the case when a read repair is in flight and a subsequent write affects one or more of the replicas that was scheduled to received the repair mutations. In this case, are you saying that we return the older version to the user rather than the latest version that was effected by the write ? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
I think so. Otherwise, we may never complete a read if writes come in continuously. On Wed, Oct 24, 2012 at 9:04 AM, shankarpnsn shankarp...@gmail.com wrote: manuzhang wrote why repair again? We block until the consistency constraint is met. Then the latest version is returned and repair is done asynchronously if any mismatch. We may retry read if fewer columns than required are returned. Just to make sure I understand you correct, considering the case when a read repair is in flight and a subsequent write affects one or more of the replicas that was scheduled to received the repair mutations. In this case, are you saying that we return the older version to the user rather than the latest version that was effected by the write ? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: constant CMS GC using CPU time
Regarding memory usage after a repair ... Are the merkle trees kept around? On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote: On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote: The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. If you restart a node is the onset of GC activity correlated to some event? Yes and no. When the nodes were generally under the .75 occupancy threshold a weekly repair -pr job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions. It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete. Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). How many keys did the CF have per node? I dismissed the memory used to hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. I've changed the app to retain less data for that CF but I think that it was about 400M rows per node. Row keys are a TimeUUID. All of the rows are write-once, never updated, and rarely read. There are no secondary indexes for this particular CF. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 The default should be 0.000744. If the chance is zero or null this code should run when a new SSTable is written // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, let's not let that break things logger.error(Bloom filter FP chance of zero isn't supposed to happen); Were the CF's migrated from an old version ? Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a upgradesstables run at each upgrade along the way. I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value. JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in describe table but I couldn't find out how to tell what the default was. I didn't inspect the source. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? See o.a.c.utils.BloomFilter.getFilter() in the code This http://hur.st/bloomfilter appears to give similar results. Ahh, very helpful. This indicates that 714MB would be used for the bloom filter for that one CF. JMX / cfstats reports Bloom Filter Space Used but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the Bloom Filter Space Used says they're currently consuming 1-2 GB of the heap which is substantial. If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0? It just means more trips to SSTable indexes for a read correct? Trade RAM for time (disk I/O). -Bryan