Re: Re[2]: how wide can wide rows get?
We have up to a few hundreds of millions of columns in a super wide row. There are two major issues you should care about. 1. the wider the row is, the more memory pressure you get for every slice query 2. repair is row based, which means a huge row could be transferred at every repair 1 is not a big issue if you don't have many concurrent slice requests. Having more cores is a good investment to reduce memory pressure. 2 could cause very high memory pressure as well as poorer disk utilization. On Fri, Nov 14, 2014 at 3:21 PM, Plotnik, Alexey aplot...@rhonda.ru wrote: We have 380k of them in some of our rows and it's ok. -- Original Message -- From: Hannu Kröger hkro...@gmail.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: 14.11.2014 16:13:49 Subject: Re: how wide can wide rows get? The theoretical limit is maybe 2 billion but recommended max is around 10-20 thousand. Br, Hannu On 14.11.2014, at 8.10, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: I’m struggling with this wide row business. Is there an upward limit on the number of columns you can have? Adaryl Bob Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData
A fix for those who suffer from GC storm by tombstones
Hi, I have filed a fix as CASSANDRA-8038, which would be a good news for those who has suffered from overwhelming GC or OOM by tombstones. Appreciate your feedbacks! Thanks, Takenori
Re: A fix for those who suffer from GC storm by tombstones
DuyHi and Rob, Thanks for your feedbacks. Yeah, that's exactly the point I found. Some may want to run read repair even on tombstones as before, but others not like Rob and us. Personally, I take read repaid as a nice to have feature, especially for tombstones, where a regular repair is anyway enforced. So with this fix, I expect that a user can choose a better, manageable risk as needed. The good news is, the improvement for performance is significant! - Takenori iPhoneから送信 2014/10/08 3:18、Robert Coli rc...@eventbrite.com のメッセージ: On Tue, Oct 7, 2014 at 1:57 AM, DuyHai Doan doanduy...@gmail.com wrote: Read Repair belongs to the Anti-Entropy procedures to ensure that eventually, data from all replicas do converge. Tombstones are data (deletion marker) so they need to be exchanged between replicas. By skipping tombstone you prevent the data convergence with regard to deletion. Read repair is an optimization. I would probably just disable it in OP's case and rely entirely on AES repair, because the 8303 approach makes read repair not actually repair in some cases... =Rob
Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram
In addition to the suggestions by Jonathan, you can run a user defined compaction against a particular set of SSTable files, where you want to remove tombstones. But to do that, you need to find such an optimal set. Here you can find a couple of helpful tools. https://github.com/cloudian/support-tools On Mon, Mar 10, 2014 at 7:41 PM, Oleg Dulin oleg.du...@gmail.com wrote: I get that :) What I'd like to know is how to fix that :) On 2014-03-09 20:24:54 +, Takenori Sato said: You have millions of org.apache.cassandra.db.DeletedColumn instances on the snapshot. This means you have lots of column tombstones, and I guess, which are read into memory by slice query. On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin oleg.du...@gmail.com wrote: I am trying to understand why one of my nodes keeps full GC. I have Xmx set to 8gigs, memtable total size is 2 gigs. Consider the top entries from jmap -histo:live @ http://pastebin.com/UaatHfpJ -- Regards, Oleg Dulin http://www.olegdulin.com -- Regards, Oleg Dulin http://www.olegdulin.com
Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram
You have millions of org.apache.cassandra.db.DeletedColumn instances on the snapshot. This means you have lots of column tombstones, and I guess, which are read into memory by slice query. On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin oleg.du...@gmail.com wrote: I am trying to understand why one of my nodes keeps full GC. I have Xmx set to 8gigs, memtable total size is 2 gigs. Consider the top entries from jmap -histo:live @ http://pastebin.com/UaatHfpJ -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Recommended amount of free disk space for compaction
Hi, If Cassandra only compacts one table at a time, then I should be safe if I keep as much free space as there is data in the largest table. If Cassandra can compact multiple tables simultaneously, then it seems that I need as much free space as all the tables put together, which means no more than 50% utilization. Based on your configuration. 1 per CPU core by default. See concurrent_compactors for details. Also, what happens if a node gets low on disk space and there isn’t enough available for compaction? A compaction checks if there's enough disk space based on its estimate. Otherwise, it won't get executed. Is there a way to salvage a node that gets into a state where it cannot compact its tables? If you carefully run some cleanups, then you'll get some room based on its new range. On Fri, Nov 29, 2013 at 12:21 PM, Robert Wille rwi...@fold3.com wrote: I’m trying to estimate our disk space requirements and I’m wondering about disk space required for compaction. My application mostly inserts new data and performs updates to existing data very infrequently, so there will be very few bytes removed by compaction. It seems that if a major compaction occurs, that performing the compaction will require as much disk space as is currently consumed by the table. So here’s my question. If Cassandra only compacts one table at a time, then I should be safe if I keep as much free space as there is data in the largest table. If Cassandra can compact multiple tables simultaneously, then it seems that I need as much free space as all the tables put together, which means no more than 50% utilization. So, how much free space do I need? Any rules of thumb anyone can offer? Also, what happens if a node gets low on disk space and there isn’t enough available for compaction? If I add new nodes to reduce the amount of data on each node, I assume the space won’t be reclaimed until a compaction event occurs. Is there a way to salvage a node that gets into a state where it cannot compact its tables? Thanks Robert
Re: Tracing Queries at Cassandra Server
In addition to CassandraServer, add StorageProxy for details as follows. log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG log4j.logger.org.apache.cassandra.thrift.CassandraServer=DEBUG Hope it would help. On Mon, Nov 11, 2013 at 11:25 AM, Srinath Perera srin...@wso2.com wrote: I am talking to Cassandra using Hector. Is there a way that I can trace the executed queries at the server? I have tired adding Enable DEBUG logging for org.apache.cassandra.thrift.CassandraServer as mentioned in Cassandra vs logging activityhttp://stackoverflow.com/questions/9604554/cassandra-vs-logging-activity. But that does not provide much info (e.g. says slice query executed, but does not give more info). What I look for is something like SQL tracing in MySQL, so all queries executed are logged. --Srinath
Re: Cass 1.1.11 out of memory during compaction ?
I would go with cleanup. Be careful for this bug. https://issues.apache.org/jira/browse/CASSANDRA-5454 On Mon, Nov 4, 2013 at 9:05 PM, Oleg Dulin oleg.du...@gmail.com wrote: If i do that, wouldn't I need to scrub my sstables ? Takenori Sato ts...@cloudian.com wrote: Try increasing column_index_size_in_kb. A slice query to get some ranges(SliceFromReadCommand) requires to read all the column indexes for the row, thus could hit OOM if you have a very wide row. On Sun, Nov 3, 2013 at 11:54 PM, Oleg Dulin oleg.du...@gmail.com wrote: Cass 1.1.11 ran out of memory on me with this exception (see below). My parameters are 8gig heap, new gen is 1200M. ERROR [ReadStage:55887] 2013-11-02 23:35:18,419 AbstractCassandraDaemon.java (line 132) Exception in thread Thread[ReadStage:55887,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323) at org.apache.cassandra.utils.ByteBufferUtil.read( ByteBufferUtil.java:398)at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37) at org.apache.cassandra.db.columniterator.IndexedSliceReader$ IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179)at org.apache.cassandra.db.columniterator.IndexedSliceReader. computeNext(IndexedSliceReader.java:121)at org.apache.cassandra.db.columniterator.IndexedSliceReader. computeNext(IndexedSliceReader.java:48)at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.columniterator. SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116)at org.apache.cassandra.utils.MergeIterator$Candidate. advance(MergeIterator.java:147)at org.apache.cassandra.utils.MergeIterator$ManyToOne. advance(MergeIterator.java:126)at org.apache.cassandra.utils.MergeIterator$ManyToOne. computeNext(MergeIterator.java:100)at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.filter.SliceQueryFilter. collectReducedColumns(SliceQueryFilter.java:117)at org.apache.cassandra.db.filter.QueryFilter. collateColumns(QueryFilter.java:140) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159) at org.apache.cassandra.db.Table.getRow(Table.java:378)at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69) at org.apache.cassandra.db.ReadVerbHandler.doVerb( ReadVerbHandler.java:51)at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Any thoughts ? This is a dual data center set up, with 4 nodes in each DC and RF=2 in each. -- Regards, Oleg Dulin a href=http://www.olegdulin.com;http://www.olegdulin.com /a
Re: Cass 1.1.11 out of memory during compaction ?
Try increasing column_index_size_in_kb. A slice query to get some ranges(SliceFromReadCommand) requires to read all the column indexes for the row, thus could hit OOM if you have a very wide row. On Sun, Nov 3, 2013 at 11:54 PM, Oleg Dulin oleg.du...@gmail.com wrote: Cass 1.1.11 ran out of memory on me with this exception (see below). My parameters are 8gig heap, new gen is 1200M. ERROR [ReadStage:55887] 2013-11-02 23:35:18,419 AbstractCassandraDaemon.java (line 132) Exception in thread Thread[ReadStage:55887,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323) at org.apache.cassandra.utils.ByteBufferUtil.read( ByteBufferUtil.java:398) at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37) at org.apache.cassandra.db.columniterator.IndexedSliceReader$ IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179) at org.apache.cassandra.db.columniterator.IndexedSliceReader. computeNext(IndexedSliceReader.java:121) at org.apache.cassandra.db.columniterator.IndexedSliceReader. computeNext(IndexedSliceReader.java:48) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.columniterator. SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116) at org.apache.cassandra.utils.MergeIterator$Candidate. advance(MergeIterator.java:147) at org.apache.cassandra.utils.MergeIterator$ManyToOne. advance(MergeIterator.java:126) at org.apache.cassandra.utils.MergeIterator$ManyToOne. computeNext(MergeIterator.java:100) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.filter.SliceQueryFilter. collectReducedColumns(SliceQueryFilter.java:117) at org.apache.cassandra.db.filter.QueryFilter. collateColumns(QueryFilter.java:140) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159) at org.apache.cassandra.db.Table.getRow(Table.java:378) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69) at org.apache.cassandra.db.ReadVerbHandler.doVerb( ReadVerbHandler.java:51) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Any thoughts ? This is a dual data center set up, with 4 nodes in each DC and RF=2 in each. -- Regards, Oleg Dulin http://www.olegdulin.com
Re: questions related to the SSTable file
So in fact, incremental backup of Cassandra is just hard link all the new SSTable files being generated during the incremental backup period. It could contain any data, not just the data being update/insert/delete in this period, correct? Correct. But over time, some old enough SSTable files are usually shared across multiple snapshots. On Wed, Sep 18, 2013 at 3:37 AM, java8964 java8964 java8...@hotmail.comwrote: Another question related to the SSTable files generated in the incremental backup is not really ONLY incremental delta, right? It will include more than delta in the SSTable files. I will use the example to show my question: first, we have this data in the SSTable file 1: rowkey(1), columns (maker=honda). later, if we add one column in the same key: rowkey(1), columns (maker=honda, color=blue) The data above being flushed to another SSTable file 2. In this case, it will be part of the incremental backup at this time. But in fact, it will contain both old data (make=honda), plus new changes (color=blue). So in fact, incremental backup of Cassandra is just hard link all the new SSTable files being generated during the incremental backup period. It could contain any data, not just the data being update/insert/delete in this period, correct? Thanks Yong From: dean.hil...@nrel.gov To: user@cassandra.apache.org Date: Tue, 17 Sep 2013 08:11:36 -0600 Subject: Re: questions related to the SSTable file Netflix created file streaming in astyanax into cassandra specifically because writing too big a column cell is a bad thing. The limit is really dependent on use case….do you have servers writing 1000's of 200Meg files at the same time….if so, astyanax streaming may be a better way to go there where it divides up the file amongst cells and rows. I know the limit of a row size is really your hard disk space and the column count if I remember goes into billions though realistically, I think beyond 10 million might slow down a bit….all I know is we tested up to 10 million columns with no issues in our use-case. So you mean at this time, I could get 2 SSTable files, both contain column Blue for the same row key, right? Yes In this case, I should be fine as value of the Blue column contain the timestamp to help me to find out which is the last change, right? Yes In MR world, each file COULD be processed by different Mapper, but will be sent to the same reducer as both data will be shared same key. If that is the way you are writing it, then yes Dean From: Shahab Yunus shahab.yu...@gmail.commailto:shahab.yu...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, September 17, 2013 7:54 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: questions related to the SSTable file derstand if following changes apply to the same row key as above example, additional SSTable file could be generated. That is
Re: questions related to the SSTable file
Thanks, Rob, for clarifying! - Takenori (2013/09/18 10:01), Robert Coli wrote: On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato ts...@cloudian.com mailto:ts...@cloudian.com wrote: So in fact, incremental backup of Cassandra is just hard link all the new SSTable files being generated during the incremental backup period. It could contain any data, not just the data being update/insert/delete in this period, correct? Correct. But over time, some old enough SSTable files are usually shared across multiple snapshots. To be clear, incremental backup feature backs up the data being modified in that period, because it writes only those files to the incremental backup dir as hard links, between full snapshots. http://www.datastax.com/docs/1.0/operations/backup_restore When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. This allows you to store backups offsite without transferring entire snapshots. Also, incremental backups combine with snapshots to provide a dependable, up-to-date backup mechanism. What Takenori is referring to is that a full snapshot is in some ways an incremental backup because it shares hard linked SSTables with other snapshots. =Rob
Re: questions related to the SSTable file
Yong, It seems there is still a misunderstanding. But there is no way we can be sure that these SSTable files will ONLY contain modified data. So the statement being quoted above is not exactly right. I agree that all the modified data in that period will be in the incremental sstable files, but a lot of other unmodified data will be in them too. memtable(a new sstable) contains only modified data as I explained by the example. If we have 2 rows data with different row key in the same memtable, and if only 2nd row being modified. When the memtable is flushed to SSTable file, it will contain both rows, and both will be in the incremental backup files. So for first row, nothing change, but it will be in the incremental backup. Unless the first row is modified, it does not exist in memtable at all. If I have one row with one column, now a new column is added, and whole row in one memtable being flushed to SSTable file, as also in this incremental backup. For first column, nothing change, but it will still be in incremental backup file. For example, if it works as you understand, then, Color-2 should contain two more rows, Lavender, and Blue with an existing column, hex, like the following. But it's not. - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #FF}}] - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}] -- your understanding - Color-2-Data.db: [{Lavender: {hex: #E6E6FA}}, {Green: {hex: #008000}}, {Blue: {hex: #FF}, {hex2: #2c86ff}}] * Row, Lavender, and Column Blue's hex have no changes The point I tried to make is this is important if I design an ETL to consume the incremental backup SSTable files. As above example, I have to realize that in the incremental backup sstable files, they could or most likely contain old data which was previous being processed already. That will require additional logic and responsibility in the ETL to handle it, or any outsider SSTable consumer to pay attention to it. I suggest to try org.apache.cassandra.tools.SSTableExport, then you will see what's going on under the hood. - Takenori On Wed, Sep 18, 2013 at 10:51 AM, java8964 java8964 java8...@hotmail.comwrote: Quote: To be clear, incremental backup feature backs up the data being modified in that period, because it writes only those files to the incremental backup dir as hard links, between full snapshots. I thought I was clearer, but your clarification confused me again. My understanding so far from all the answer I got so far, I believe, the more accurate statement of incremental backup should be incremental backup feature backs up the SSTable files being generated in that period. But there is no way we can be sure that these SSTable files will ONLY contain modified data. So the statement being quoted above is not exactly right. I agree that all the modified data in that period will be in the incremental sstable files, but a lot of other unmodified data will be in them too. If we have 2 rows data with different row key in the same memtable, and if only 2nd row being modified. When the memtable is flushed to SSTable file, it will contain both rows, and both will be in the incremental backup files. So for first row, nothing change, but it will be in the incremental backup. If I have one row with one column, now a new column is added, and whole row in one memtable being flushed to SSTable file, as also in this incremental backup. For first column, nothing change, but it will still be in incremental backup file. The point I tried to make is this is important if I design an ETL to consume the incremental backup SSTable files. As above example, I have to realize that in the incremental backup sstable files, they could or most likely contain old data which was previous being processed already. That will require additional logic and responsibility in the ETL to handle it, or any outsider SSTable consumer to pay attention to it. Yong -- Date: Tue, 17 Sep 2013 18:01:45 -0700 Subject: Re: questions related to the SSTable file From: rc...@eventbrite.com To: user@cassandra.apache.org On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato ts...@cloudian.com wrote: So in fact, incremental backup of Cassandra is just hard link all the new SSTable files being generated during the incremental backup period. It could contain any data, not just the data being update/insert/delete in this period, correct? Correct. But over time, some old enough SSTable files are usually shared across multiple snapshots. To be clear, incremental backup feature backs up the data being modified in that period, because it writes only those files to the incremental backup dir as hard links, between full snapshots. http://www.datastax.com/docs/1.0/operations/backup_restore When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under
Re: questions related to the SSTable file
Hi, 1) I will expect same row key could show up in both sstable2json output, as this one row exists in both SSTable files, right? Yes. 2) If so, what is the boundary? Will Cassandra guarantee the column level as the boundary? What I mean is that for one column's data, it will be guaranteed to be either in the first file, or 2nd file, right? There is no chance that Cassandra will cut the data of one column into 2 part, and one part stored in first SSTable file, and the other part stored in second SSTable file. Is my understanding correct? No. 3) If what we are talking about are only the SSTable files in snapshot, incremental backup SSTable files, exclude the runtime SSTable files, will anything change? For snapshot or incremental backup SSTable files, first can one row data still may exist in more than one SSTable file? And any boundary change in this case? 4) If I want to use incremental backup SSTable files as the way to catch data being changed, is it a good way to do what I try to archive? In this case, what happen in the following example: I don't fully understand, but snapshot will do. It will create hard links to all the SSTable files present at snapshot. Let me explain how SSTable and compaction works. Suppose we have 4 files being compacted(the last one has bee just flushed, then which triggered compaction). Note that file names are simplified. - Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #FF}}] - Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}] - Color-3-Data.db: [{Aqua: {hex: #00}}, {Green: {hex2: #32CD32}}, {Blue: {}}] - Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}] They are created by the following operations. - Add a row of (key, column, column_value = Blue, hex, #FF) - Add a row of (key, column, column_value = Lavender, hex, #E6E6FA) memtable is flushed = Color-1-Data.db - Add a row of (key, column, column_value = Green, hex, #008000) - Add a column of (key, column, column_value = Blue, hex2, #2c86ff) memtable is flushed = Color-2-Data.db - Add a column of (key, column, column_value = Green, hex2, #32CD32) - Add a row of (key, column, column_value = Aqua, hex, #00) - Delete a row of (key = Blue) memtable is flushed = Color-3-Data.db - Add a row of (key, column, column_value = Magenta, hex, #FF00FF) - Add a row of (key, column, column_value = Gold, hex, #FFD700) memtable is flushed = Color-4-Data.db Then, a compaction will merge all those fragments together into the latest ones as follows. - Color-5-Data.db: [{Lavender: {hex: #E6E6FA}, {Aqua: {hex: #00}, {Green: {hex: #008000, hex2: #32CD32}}, {Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}] * assuming RandomPartitioner is used Hope they would help. - Takenori (2013/09/17 10:51), java8964 java8964 wrote: Hi, I have some questions related to the SSTable in the Cassandra, as I am doing a project to use it and hope someone in this list can share some thoughts. My understand is the SSTable is per column family. But each column family could have multi SSTable files. During the runtime, one row COULD split into more than one SSTable file, even this is not good for performance, but it does happen, and Cassandra will try to merge and store one row data into one SSTable file during compassion. The question is when one row is split in multi SSTable files, what is the boundary? Or let me ask this way, if one row exists in 2 SSTable files, if I run sstable2json tool to run on both SSTable files individually: 1) I will expect same row key could show up in both sstable2json output, as this one row exists in both SSTable files, right? 2) If so, what is the boundary? Will Cassandra guarantee the column level as the boundary? What I mean is that for one column's data, it will be guaranteed to be either in the first file, or 2nd file, right? There is no chance that Cassandra will cut the data of one column into 2 part, and one part stored in first SSTable file, and the other part stored in second SSTable file. Is my understanding correct? 3) If what we are talking about are only the SSTable files in snapshot, incremental backup SSTable files, exclude the runtime SSTable files, will anything change? For snapshot or incremental backup SSTable files, first can one row data still may exist in more than one SSTable file? And any boundary change in this case? 4) If I want to use incremental backup SSTable files as the way to catch data being changed, is it a good way to do what I try to archive? In this case, what happen in the following example: For column family A: at Time 0, one row key (key1) has some data. It is being stored and back up in SSTable file 1. at Time 1, if any column for key1 has any change (a new column insert, a column updated/deleted, or even whole row being deleted), I will expect this whole row exists in the any incremental backup SSTable files
/proc/sys/vm/zone_reclaim_mode
Hi, I am investigating NUMA issues. I have been aware that bin/cassandra tries to use interleave all policy if available. https://issues.apache.org/jira/browse/CASSANDRA-2594 https://issues.apache.org/jira/browse/CASSANDRA-3245 So what about /proc/sys/vm/zone_reclaim_mode? Any recommendations? I didn't find any in respect to Cassandra. By default on Linux NUMA machine, this is set 1 that tries to reclaim some pages in a zone rather than acquiring others from the other zones. Explicitly disabling this sounds better. It may be beneficial to switch off zone reclaim if the system is used for a file server and all of memory should be used for caching files from disk. In that case the caching effect is more important than data locality. https://www.kernel.org/doc/Documentation/sysctl/vm.txt Thanks! Takenori
Re: Random Distribution, yet Order Preserving Partitioner
Hi Manoj, Thanks for your advise. More or less, basically we do the same. As you pointed out, we now face with many cases that can not be solved by data modeling, and which are reaching to 100 millions of columns. We can split them down to multiple pieces of metadata rows, but that will bring more complexity, thus error prone. If possible, want to avoid that. - Takenori 2013/08/27 21:37、Manoj Mainali mainalima...@gmail.com のメッセージ: Hi Takenori, I can't tell for sure without knowing what kind of data you have and how much you have.You can use the random partitioner and use the concept of metadata row that stores the row key, as for example like below {metadata_row}: key1 | key2 | key3 key1:column1 | column2 When you do the read you can always directly query by the key, if you already know it. In the case of range queries, first you query the metadata_row and get the keys you want in the ordered fashion. Then you can do multi_get to get you actual data. The downside is you have to do two read queries, and depending on how much data you have you will end up with a wide metadata row. Manoj On Fri, Aug 23, 2013 at 8:47 AM, Takenori Sato ts...@cloudian.com wrote: Hi Nick, token and key are not same. it was like this long time ago (single MD5 assumed single key) True. That reminds me of making a test with the latest 1.2 instead of our current 1.0! if you want ordered, you probably can arrange your data in a way so you can get it in ordered fashion. Yeah, we have done for a long time. That's called a wide row, right? Or a compound primary key. It can handle some millions of columns, but not more like 10M. I mean, a request for such a row concentrates on a particular node, so the performance degrades. I also had idea for semi-ordered partitioner - instead of single MD5, to have two MD5's. Sounds interesting. But, we need a fully ordered result. Anyway, I will try with the latest version. Thanks, Takenori On Thu, Aug 22, 2013 at 6:12 PM, Nikolay Mihaylov n...@nmmm.nu wrote: my five cents - token and key are not same. it was like this long time ago (single MD5 assumed single key) if you want ordered, you probably can arrange your data in a way so you can get it in ordered fashion. for example long ago, i had single column family with single key and about 2-3 M columns - I do not suggest you to do it this way, because is wrong way, but it is easy to understand the idea. I also had idea for semi-ordered partitioner - instead of single MD5, to have two MD5's. then you can get semi-ordered ranges, e.g. you get ordered all cities in Canada, all cities in US and so on. however in this way things may get pretty non-ballanced Nick On Thu, Aug 22, 2013 at 11:19 AM, Takenori Sato ts...@cloudian.comwrote: Hi, I am trying to implement a custom partitioner that evenly distributes, yet preserves order. The partitioner returns a token by BigInteger as RandomPartitioner does, while does a decorated key by string as OrderPreservingPartitioner does. * for now, since IPartitionerT does not support different types for token and key, BigInteger is simply converted to string Then, I played around with cassandra-cli. As expected, in my 3 nodes test cluster, get/set worked, but list(get_range_slices) didn't. This came from a challenge to overcome a wide row scalability. So, I want to make it work! I am aware that some efforts are required to make get_range_slices work. But are there any other critical problems? For example, it seems there is an assumption that token and key are the same. If this is throughout the whole C* code, this partitioner is not practical. Or have your tried something similar? I would appreciate your feedback! Thanks, Takenori
Re: OrderPreservingPartitioner in 1.2
From the Jira, One possibility is that getToken of OPP can return hex value if it fails to encode bytes to UTF-8 instead of throwing error. By this system tables seem to be working fine with OPP. This looks like an option to try for me. Thanks! (2013/08/23 20:44), Vara Kumar wrote: For the first exception: OPP was not working in 1.2. It has been fixed but not yet there in latest 1.2.8 version. Jira issue about it: https://issues.apache.org/jira/browse/CASSANDRA-5793 On Fri, Aug 23, 2013 at 12:51 PM, Takenori Sato ts...@cloudian.com mailto:ts...@cloudian.com wrote: Hi, I know it has been depreciated, but OrderPreservingPartitioner still works with 1.2? Just wanted to know how it works, but I got a couple of exceptions as below: ERROR [GossipStage:2] 2013-08-23 07:03:57,171 CassandraDaemon.java (line 175) Exception in thread Thread[GossipStage:2,5,main] java.lang.RuntimeException: The provided key was not UTF8 encoded. at org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:233) at org.apache.cassandra.dht.OrderPreservingPartitioner.decorateKey(OrderPreservingPartitioner.java:53) at org.apache.cassandra.db.Table.apply(Table.java:379) at org.apache.cassandra.db.Table.apply(Table.java:353) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:258) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:117) at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:172) at org.apache.cassandra.db.SystemTable.updatePeerInfo(SystemTable.java:258) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:935) at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:926) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:884) at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:57) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:260) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:781) at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167) at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:124) at org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:229) ... 16 more The key was 0ab68145 in HEX, that contains some control characters. Another exception is this: INFO [main] 2013-08-23 07:04:27,659 StorageService.java (line 891) JOINING: Starting to bootstrap... DEBUG [main] 2013-08-23 07:04:27,659 BootStrapper.java (line 73) Beginning bootstrap process ERROR [main] 2013-08-23 07:04:27,666 CassandraDaemon.java (line 430) Exception encountered during startup java.lang.IllegalStateException: No sources found for (H,H] at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithSourcesFor(RangeStreamer.java:163) at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:121) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:548) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:445) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456) ERROR [StorageServiceShutdownHook] 2013-08-23 07:04:27,672 CassandraDaemon.java (line 175) Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513
OrderPreservingPartitioner in 1.2
Hi, I know it has been depreciated, but OrderPreservingPartitioner still works with 1.2? Just wanted to know how it works, but I got a couple of exceptions as below: ERROR [GossipStage:2] 2013-08-23 07:03:57,171 CassandraDaemon.java (line 175) Exception in thread Thread[GossipStage:2,5,main] java.lang.RuntimeException: The provided key was not UTF8 encoded. at org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:233) at org.apache.cassandra.dht.OrderPreservingPartitioner.decorateKey(OrderPreservingPartitioner.java:53) at org.apache.cassandra.db.Table.apply(Table.java:379) at org.apache.cassandra.db.Table.apply(Table.java:353) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:258) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:117) at org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:172) at org.apache.cassandra.db.SystemTable.updatePeerInfo(SystemTable.java:258) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:935) at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:926) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:884) at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:57) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:260) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:781) at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167) at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:124) at org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:229) ... 16 more The key was 0ab68145 in HEX, that contains some control characters. Another exception is this: INFO [main] 2013-08-23 07:04:27,659 StorageService.java (line 891) JOINING: Starting to bootstrap... DEBUG [main] 2013-08-23 07:04:27,659 BootStrapper.java (line 73) Beginning bootstrap process ERROR [main] 2013-08-23 07:04:27,666 CassandraDaemon.java (line 430) Exception encountered during startup java.lang.IllegalStateException: No sources found for (H,H] at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithSourcesFor(RangeStreamer.java:163) at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:121) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:548) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:445) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456) ERROR [StorageServiceShutdownHook] 2013-08-23 07:04:27,672 CassandraDaemon.java (line 175) Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:662) I tried to setup 3 nodes cluster with tokens, A, H, P for each. This error was raised by the second node with the token, H. Thanks, Takenori
Re: Random Distribution, yet Order Preserving Partitioner
Hi Nick, token and key are not same. it was like this long time ago (single MD5 assumed single key) True. That reminds me of making a test with the latest 1.2 instead of our current 1.0! if you want ordered, you probably can arrange your data in a way so you can get it in ordered fashion. Yeah, we have done for a long time. That's called a wide row, right? Or a compound primary key. It can handle some millions of columns, but not more like 10M. I mean, a request for such a row concentrates on a particular node, so the performance degrades. I also had idea for semi-ordered partitioner - instead of single MD5, to have two MD5's. Sounds interesting. But, we need a fully ordered result. Anyway, I will try with the latest version. Thanks, Takenori On Thu, Aug 22, 2013 at 6:12 PM, Nikolay Mihaylov n...@nmmm.nu wrote: my five cents - token and key are not same. it was like this long time ago (single MD5 assumed single key) if you want ordered, you probably can arrange your data in a way so you can get it in ordered fashion. for example long ago, i had single column family with single key and about 2-3 M columns - I do not suggest you to do it this way, because is wrong way, but it is easy to understand the idea. I also had idea for semi-ordered partitioner - instead of single MD5, to have two MD5's. then you can get semi-ordered ranges, e.g. you get ordered all cities in Canada, all cities in US and so on. however in this way things may get pretty non-ballanced Nick On Thu, Aug 22, 2013 at 11:19 AM, Takenori Sato ts...@cloudian.comwrote: Hi, I am trying to implement a custom partitioner that evenly distributes, yet preserves order. The partitioner returns a token by BigInteger as RandomPartitioner does, while does a decorated key by string as OrderPreservingPartitioner does. * for now, since IPartitionerT does not support different types for token and key, BigInteger is simply converted to string Then, I played around with cassandra-cli. As expected, in my 3 nodes test cluster, get/set worked, but list(get_range_slices) didn't. This came from a challenge to overcome a wide row scalability. So, I want to make it work! I am aware that some efforts are required to make get_range_slices work. But are there any other critical problems? For example, it seems there is an assumption that token and key are the same. If this is throughout the whole C* code, this partitioner is not practical. Or have your tried something similar? I would appreciate your feedback! Thanks, Takenori
Fp chance for column level bloom filter
Hi, I thought memory consumption of column level bloom filter will become a big concern when a row becomes very wide like more than tens of millions of columns. But I read from source(1.0.7) that fp chance for column level bloom filter is hard-coded as 0.160, which is very high. So seems not. Is this correct? Thanks, Takenori
Re: Alternate major compaction
It's light. Without -v option, you can even run it against just a SSTable file without needing the whole Cassandra installation. - Takenori On Sat, Jul 13, 2013 at 6:18 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jul 11, 2013 at 9:43 PM, Takenori Sato ts...@cloudian.com wrote: I made the repository public. Now you can checkout from here. https://github.com/cloudian/support-tools checksstablegarbage is the tool. Enjoy, and any feedback is welcome. Thanks very much, useful tool! Out of curiousity, what does writesstablekeys do that the upstream tool sstablekeys does not? =Rob
Re: Alternate major compaction
Hi, I made the repository public. Now you can checkout from here. https://github.com/cloudian/support-tools checksstablegarbage is the tool. Enjoy, and any feedback is welcome. Thanks, - Takenori On Thu, Jul 11, 2013 at 10:12 PM, srmore comom...@gmail.com wrote: Thanks Takenori, Looks like the tool provides some good info that people can use. It would be great if you can share it with the community. On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato ts...@cloudian.com wrote: Hi, I think it is a common headache for users running a large Cassandra cluster in production. Running a major compaction is not the only cause, but more. For example, I see two typical scenario. 1. backup use case 2. active wide row In the case of 1, say, one data is removed a year later. This means, tombstone on the row is 1 year away from the original row. To remove an expired row entirely, a compaction set has to include all the rows. So, when do the original, 1 year old row, and the tombstoned row are included in a compaction set? It is likely to take one year. In the case of 2, such an active wide row exists in most of sstable files. And it typically contains many expired columns. But none of them wouldn't be removed entirely because a compaction set practically do not include all the row fragments. Btw, there is a very convenient MBean API is available. It is CompactionManager's forceUserDefinedCompaction. You can invoke a minor compaction on a file set you define. So the question is how to find an optimal set of sstable files. Then, I wrote a tool to check garbage, and print outs some useful information to find such an optimal set. Here's a simple log output. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504071)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db --- TOTAL, 40, 40 === REMAINNING_SSTABLE_FILES means any other sstable files that contain the respective row. So, the following is an optimal set. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504131)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 223, 0, YES, YES --- TOTAL, 223, 0 === This tool relies on SSTableReader and an aggregation iterator as Cassandra does in compaction. I was considering to share this with the community. So let me know if anyone is interested. Ah, note that it is based on 1.0.7. So I will need to check and update for newer versions. Thanks, Takenori On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.comwrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the sstables from another node as they were, cleaning nothing. So I tried a new approach: I switched the sstable compaction strategy (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, and then switching it back (Leveled to SizeTiered). It took a while (but so do the major compaction process) and it worked, I have smaller sstables
Re: Reduce Cassandra GC
GC options are not set. You should see the followings. -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure -Xloggc:/var/log/cassandra/gc-1371603607.log Is it normal to have two processes like this? No. You are running two processes. On Wed, Jun 19, 2013 at 4:16 PM, Joel Samuelsson samuelsson.j...@gmail.comwrote: My Cassandra ps info: root 26791 1 0 07:14 ?00:00:00 /usr/bin/jsvc -user cassandra -home /opt/java/64/jre1.6.0_32/bin/../ -pidfile /var/run/cassandra.pid -errfile 1 -outfile /var/log/cassandra/output.log -cp /usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang-2.6.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/guava-13.0.1.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/jna.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.7.0.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.1.0.jar:/usr/share/cassandra/lib/metrics-core-2.0.3.jar:/usr/share/cassandra/lib/netty-3.5.9.Final.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar:/usr/share/cassandra/lib/snappy-java-1.0.4.1.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/apache-cassandra-1.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-1.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar:/usr/share/java/jna.jar:/etc/cassandra:/usr/share/java/commons-daemon.jar -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true -XX:HeapDumpPath=/var/lib/cassandra/java_1371626058.hprof -XX:ErrorFile=/var/lib/cassandra/hs_err_1371626058.log -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms4004M -Xmx4004M -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false org.apache.cassandra.service.CassandraDaemon 103 26792 26791 99 07:14 ?854015-22:02:22 /usr/bin/jsvc -user cassandra -home /opt/java/64/jre1.6.0_32/bin/../ -pidfile /var/run/cassandra.pid -errfile 1 -outfile /var/log/cassandra/output.log -cp /usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang-2.6.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/guava-13.0.1.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/jna.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.7.0.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.1.0.jar:/usr/share/cassandra/lib/metrics-core-2.0.3.jar:/usr/share/cassandra/lib/netty-3.5.9.Final.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar:/usr/share/cassandra/lib/snappy-java-1.0.4.1.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/apache-cassandra-1.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-1.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar:/usr/share/java/jna.jar:/etc/cassandra:/usr/share/java/commons-daemon.jar -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true -XX:HeapDumpPath=/var/lib/cassandra/java_1371626058.hprof -XX:ErrorFile=/var/lib/cassandra/hs_err_1371626058.log -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms4004M -Xmx4004M -Xmn800M
Re: Reduce Cassandra GC
0,0 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line 116) testing_Keyspace.cf19 0,0 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line 116) testing_Keyspace.cf20 0,0 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line 116) testing_Keyspace.cf21 0,0 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line 116) testing_Keyspace.cf22 0,0 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line 116) OpsCenter.rollups7200 0,0 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line 116) OpsCenter.rollups864000,0 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line 116) OpsCenter.rollups60 13745,3109686 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line 116) OpsCenter.events 18,826 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,518 StatusLogger.java (line 116) OpsCenter.rollups300 2516,570931 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,519 StatusLogger.java (line 116) OpsCenter.pdps9072,160850 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,519 StatusLogger.java (line 116) OpsCenter.events_timeline3,86 INFO [ScheduledTasks:1] 2013-06-17 08:13:47,520 StatusLogger.java (line 116) OpsCenter.settings0,0 And from gc-1371454124.log I get: 2013-06-17T08:11:22.300+: 2551.288: [GC 870971K-216494K(4018176K), 145.1887460 secs] 2013/6/18 Takenori Sato ts...@cloudian.com Find promotion failure. Bingo if it happened at the time. Otherwise, post the relevant portion of the log here. Someone may find a hint. On Mon, Jun 17, 2013 at 5:51 PM, Joel Samuelsson samuelsson.j...@gmail.com wrote: Just got a very long GC again. What am I to look for in the logging I just enabled? 2013/6/17 Joel Samuelsson samuelsson.j...@gmail.com If you are talking about 1.2.x then I also have memory problems on the idle cluster: java memory constantly slow grows up to limit, then spend long time for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle cluster java memory stay on the same value. No I am running Cassandra 1.1.8. Can you paste you gc config? I believe the relevant configs are these: # GC tuning options JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly I haven't changed anything in the environment config up until now. Also can you take a heap dump at 2 diff points so that we can compare it? I can't access the machine at all during the stop-the-world freezes. Was that what you wanted me to try? Uncomment the followings in cassandra-env.sh. Done. Will post results as soon as I get a new stop-the-world gc. If you are unable to find a JIRA, file one Unless this turns out to be a problem on my end, I will.
Re: Reduce Cassandra GC
Find promotion failure. Bingo if it happened at the time. Otherwise, post the relevant portion of the log here. Someone may find a hint. On Mon, Jun 17, 2013 at 5:51 PM, Joel Samuelsson samuelsson.j...@gmail.comwrote: Just got a very long GC again. What am I to look for in the logging I just enabled? 2013/6/17 Joel Samuelsson samuelsson.j...@gmail.com If you are talking about 1.2.x then I also have memory problems on the idle cluster: java memory constantly slow grows up to limit, then spend long time for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle cluster java memory stay on the same value. No I am running Cassandra 1.1.8. Can you paste you gc config? I believe the relevant configs are these: # GC tuning options JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly I haven't changed anything in the environment config up until now. Also can you take a heap dump at 2 diff points so that we can compare it? I can't access the machine at all during the stop-the-world freezes. Was that what you wanted me to try? Uncomment the followings in cassandra-env.sh. Done. Will post results as soon as I get a new stop-the-world gc. If you are unable to find a JIRA, file one Unless this turns out to be a problem on my end, I will.
Re: Reduce Cassandra GC
INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line 122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is 1046937600 This says GC for New Generation took so long. And this is usually unlikely. The only situation I am aware of is when a fairly large object is created, and which can not be promoted to Old Generation because it requires such a large *contiguous* memory space that is unavailable at the point in time. This is called promotion failure. So it has to wait until concurrent collector collects a large enough space. Thus you experience stop the world. But I think it is not stop the world, but only stop the new world. For example in case of Cassandra, a large number of in_memory_compaction_limit_in_mb can cause this. This is a limit when a compaction compacts(merges) rows of a key into the latest in memory. So this creates a large byte array up to the number. You can confirm this by enabling promotion failure GC logging in the future, and by checking compactions executed at that point in time. On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli rc...@eventbrite.com wrote: On Fri, Jun 7, 2013 at 12:42 PM, Igor i...@4friends.od.ua wrote: If you are talking about 1.2.x then I also have memory problems on the idle cluster: java memory constantly slow grows up to limit, then spend long time for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle cluster java memory stay on the same value. If you are not aware of a pre-existing JIRA, I strongly encourage you to : 1) Document your experience of this. 2) Search issues.apache.org for anything that sounds similar. 3) If you are unable to find a JIRA, file one. Thanks! =Rob
Re: Reduce Cassandra GC
Uncomment the followings in cassandra-env.sh. JVM_OPTS=$JVM_OPTS -XX:+PrintGCDateStamps JVM_OPTS=$JVM_OPTS -XX:+PrintPromotionFailure JVM_OPTS=$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +%s`.log * *Also can you take a heap dump at 2 diff points so that we can compare it? No, I'm afraid. I ordinary use profiling tools, but am not aware of anything that could respond during this event. On Sun, Jun 16, 2013 at 4:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote: Can you paste you gc config? Also can you take a heap dump at 2 diff points so that we can compare it? Quick thing to do would be to do a histo live at 2 points and compare Sent from my iPhone On Jun 15, 2013, at 6:57 AM, Takenori Sato ts...@cloudian.com wrote: INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line 122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is 1046937600 This says GC for New Generation took so long. And this is usually unlikely. The only situation I am aware of is when a fairly large object is created, and which can not be promoted to Old Generation because it requires such a large *contiguous* memory space that is unavailable at the point in time. This is called promotion failure. So it has to wait until concurrent collector collects a large enough space. Thus you experience stop the world. But I think it is not stop the world, but only stop the new world. For example in case of Cassandra, a large number of in_memory_compaction_limit_in_mb can cause this. This is a limit when a compaction compacts(merges) rows of a key into the latest in memory. So this creates a large byte array up to the number. You can confirm this by enabling promotion failure GC logging in the future, and by checking compactions executed at that point in time. On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli rc...@eventbrite.comwrote: On Fri, Jun 7, 2013 at 12:42 PM, Igor i...@4friends.od.ua wrote: If you are talking about 1.2.x then I also have memory problems on the idle cluster: java memory constantly slow grows up to limit, then spend long time for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle cluster java memory stay on the same value. If you are not aware of a pre-existing JIRA, I strongly encourage you to : 1) Document your experience of this. 2) Search issues.apache.org for anything that sounds similar. 3) If you are unable to find a JIRA, file one. Thanks! =Rob
Re: Reduce Cassandra GC
Also can you take a heap dump at 2 diff points so that we can compare it? Also note that a promotion failure won't happen by a particular object, but by a fragmentation in Old Generation space. So I am not sure if you can't tell by a heap dump comparison. On Sun, Jun 16, 2013 at 4:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote: Can you paste you gc config? Also can you take a heap dump at 2 diff points so that we can compare it? Quick thing to do would be to do a histo live at 2 points and compare Sent from my iPhone On Jun 15, 2013, at 6:57 AM, Takenori Sato ts...@cloudian.com wrote: INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line 122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is 1046937600 This says GC for New Generation took so long. And this is usually unlikely. The only situation I am aware of is when a fairly large object is created, and which can not be promoted to Old Generation because it requires such a large *contiguous* memory space that is unavailable at the point in time. This is called promotion failure. So it has to wait until concurrent collector collects a large enough space. Thus you experience stop the world. But I think it is not stop the world, but only stop the new world. For example in case of Cassandra, a large number of in_memory_compaction_limit_in_mb can cause this. This is a limit when a compaction compacts(merges) rows of a key into the latest in memory. So this creates a large byte array up to the number. You can confirm this by enabling promotion failure GC logging in the future, and by checking compactions executed at that point in time. On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli rc...@eventbrite.comwrote: On Fri, Jun 7, 2013 at 12:42 PM, Igor i...@4friends.od.ua wrote: If you are talking about 1.2.x then I also have memory problems on the idle cluster: java memory constantly slow grows up to limit, then spend long time for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle cluster java memory stay on the same value. If you are not aware of a pre-existing JIRA, I strongly encourage you to : 1) Document your experience of this. 2) Search issues.apache.org for anything that sounds similar. 3) If you are unable to find a JIRA, file one. Thanks! =Rob
Re: Cleanup understastanding
But, that is still awkward. Does cleanup take so much disk space to complete the compaction operation? In other words, twice the size? Not really, but logically yes. According to 1.0.7 source, cleanup checks if there's enough space that is larger than the worst scenario as below. If not, the exception you got is thrown. /* * Add up all the files sizes this is the worst case file * size for compaction of all the list of files given. */ public long getExpectedCompactedFileSize(IterableSSTableReader sstables) { long expectedFileSize = 0; for (SSTableReader sstable : sstables) { long size = sstable.onDiskLength(); expectedFileSize = expectedFileSize + size; } return expectedFileSize; } On Wed, May 29, 2013 at 10:43 PM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com wrote: Thanks for the answers. I got it. I was using cleanup, because I thought it would delete the tombstones. But, that is still awkward. Does cleanup take so much disk space to complete the compaction operation? In other words, twice the size? *Atenciosamente,* *Víctor Hugo Molinar - *@vhmolinar http://twitter.com/#!/vhmolinar On Tue, May 28, 2013 at 9:55 PM, Takenori Sato(Cloudian) ts...@cloudian.com wrote: Hi Victor, As Andrey said, running cleanup doesn't work as you expect. The reason I need to clean things is that I wont need most of my inserted data on the next day. Deleted objects(columns/records) become deletable from sstable file when they get expired(after gc_grace_seconds). Such deletable objects are actually gotten rid of by compaction. The tricky part is that a deletable object remains unless all of its old objects(the same row key) are contained in the set of sstable files involved in the compaction. - Takenori (2013/05/29 3:01), Andrey Ilinykh wrote: cleanup removes data which doesn't belong to the current node. You have to run it only if you move (or add new) nodes. In your case there is no any reason to do it. On Tue, May 28, 2013 at 7:39 AM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com wrote: Hello everyone. I have a daily maintenance task at c* which does: -truncate cfs -clearsnapshots -repair -cleanup The reason I need to clean things is that I wont need most of my inserted data on the next day. It's kind a business requirement. Well, the problem I'm running to, is the misunderstanding about cleanup operation. I have 2 nodes with lower than half usage of disk, which is moreless 13GB; But, the last few days, arbitrarily each node have reported me a cleanup error indicating that the disk was full. Which is not true. *Error occured during cleanup* *java.util.concurrent.ExecutionException: java.io.IOException: disk full * So I'd like to know more about what does happens in a cleanup operation. Appreciate any help.
Re: Cleanup understastanding
Hi Victor, As Andrey said, running cleanup doesn't work as you expect. The reason I need to clean things is that I wont need most of my inserted data on the next day. Deleted objects(columns/records) become deletable from sstable file when they get expired(after gc_grace_seconds). Such deletable objects are actually gotten rid of by compaction. The tricky part is that a deletable object remains unless all of its old objects(the same row key) are contained in the set of sstable files involved in the compaction. - Takenori (2013/05/29 3:01), Andrey Ilinykh wrote: cleanup removes data which doesn't belong to the current node. You have to run it only if you move (or add new) nodes. In your case there is no any reason to do it. On Tue, May 28, 2013 at 7:39 AM, Víctor Hugo Oliveira Molinar vhmoli...@gmail.com mailto:vhmoli...@gmail.com wrote: Hello everyone. I have a daily maintenance task at c* which does: -truncate cfs -clearsnapshots -repair -cleanup The reason I need to clean things is that I wont need most of my inserted data on the next day. It's kind a business requirement. Well, the problem I'm running to, is the misunderstanding about cleanup operation. I have 2 nodes with lower than half usage of disk, which is moreless 13GB; But, the last few days, arbitrarily each node have reported me a cleanup error indicating that the disk was full. Which is not true. /Error occured during cleanup/ /java.util.concurrent.ExecutionException: java.io.IOException: disk full/ So I'd like to know more about what does happens in a cleanup operation. Appreciate any help.
Re: CPU hotspot at BloomFilterSerializer#deserialize
Hi, We found this issue is specific to 1.0.1 through 1.0.8, which was fixed at 1.0.9. https://issues.apache.org/jira/browse/CASSANDRA-4023 So by upgrading, we will see a reasonable performnace no matter how large row we have! Thanks, Takenori (2013/02/05 2:29), aaron morton wrote: Yes, it contains a big row that goes up to 2GB with more than a million of columns. I've run tests with 10 million small columns and reasonable performance. I've not looked at 1 million large columns. - BloomFilterSerializer#deserialize does readLong iteratively at each page of size 4K for a given row, which means it could be 500,000 loops(calls readLong) for a 2G row(from 1.0.7 source). There is only one Bloom filter per row in an SSTable, not one per column index/page. It could take a while if there are a lot of sstables in the read. nodetool cfhistorgrams will let you know, run it once to reset the counts , then do your test, then run it again. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/02/2013, at 4:13 AM, Edward Capriolo edlinuxg...@gmail.com mailto:edlinuxg...@gmail.com wrote: It is interesting the press c* got about having 2 billion columns in a row. You *can* do it but it brings to light some realities of what that means. On Sun, Feb 3, 2013 at 8:09 AM, Takenori Sato ts...@cloudian.com mailto:ts...@cloudian.com wrote: Hi Aaron, Thanks for your answers. That helped me get a big picture. Yes, it contains a big row that goes up to 2GB with more than a million of columns. Let me confirm if I correctly understand. - The stack trace is from Slice By Names query. And the deserialization is at the step 3, Read the row level Bloom Filter, on your blog. - BloomFilterSerializer#deserialize does readLong iteratively at each page of size 4K for a given row, which means it could be 500,000 loops(calls readLong) for a 2G row(from 1.0.7 source). Correct? That makes sense Slice By Names queries against such a wide row could be CPU bottleneck. In fact, in our test environment, a BloomFilterSerializer#deserialize of such a case takes more than 10ms, up to 100ms. Get a single named column. Get the first 10 columns using the natural column order. Get the last 10 columns using the reversed order. Interesting. A query pattern could make a difference? We thought the only solutions is to change the data structure(don't use such a wide row if it is retrieved by Slice By Names query). Anyway, will give it a try! Best, Takenori On Sat, Feb 2, 2013 at 2:55 AM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G) So very large rows ? What does nodetool cfstats or cfhistograms say about the row sizes ? 1. what is happening? I think this is partially large rows and partially the query pattern, this is only by roughly correct http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my talk here http://www.datastax.com/events/cassandrasummit2012/presentations 3. any more info required to proceed? Do some tests with different query techniques… Get a single named column. Get the first 10 columns using the natural column order. Get the last 10 columns using the reversed order. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 31/01/2013, at 7:20 PM, Takenori Sato ts...@cloudian.com wrote: Hi all, We have a situation that CPU loads on some of our nodes in a cluster has spiked occasionally since the last November, which is triggered by requests for rows that reside on two specific sstables. We confirmed the followings(when spiked): version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8 jdk: Oracle 1.6.0 1. a profiling showed that BloomFilterSerializer#deserialize was the hotspot(70% of the total load by running threads) * the stack trace looked like this(simplified) 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow ... 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData ... 89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read ... 79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter 68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize 66.7% - java.io.DataInputStream.readLong 2. Usually, 1 should be so fast that a profiling by sampling can not detect 3. no pressure on Cassandra's VM heap nor on machine in overal 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat 1 1000) 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G) 6. the problematic Filter file size is only 256B(could be normal) So now, I am trying to read the Filter file in the same way BloomFilterSerializer#deserialize does as possible as I can
Re: CPU hotspot at BloomFilterSerializer#deserialize
Hi Aaron, Thanks for your answers. That helped me get a big picture. Yes, it contains a big row that goes up to 2GB with more than a million of columns. Let me confirm if I correctly understand. - The stack trace is from Slice By Names query. And the deserialization is at the step 3, Read the row level Bloom Filter, on your blog. - BloomFilterSerializer#deserialize does readLong iteratively at each page of size 4K for a given row, which means it could be 500,000 loops(calls readLong) for a 2G row(from 1.0.7 source). Correct? That makes sense Slice By Names queries against such a wide row could be CPU bottleneck. In fact, in our test environment, a BloomFilterSerializer#deserialize of such a case takes more than 10ms, up to 100ms. Get a single named column. Get the first 10 columns using the natural column order. Get the last 10 columns using the reversed order. Interesting. A query pattern could make a difference? We thought the only solutions is to change the data structure(don't use such a wide row if it is retrieved by Slice By Names query). Anyway, will give it a try! Best, Takenori On Sat, Feb 2, 2013 at 2:55 AM, aaron morton aa...@thelastpickle.comwrote: 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G) So very large rows ? What does nodetool cfstats or cfhistograms say about the row sizes ? 1. what is happening? I think this is partially large rows and partially the query pattern, this is only by roughly correct http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my talk here http://www.datastax.com/events/cassandrasummit2012/presentations 3. any more info required to proceed? Do some tests with different query techniques… Get a single named column. Get the first 10 columns using the natural column order. Get the last 10 columns using the reversed order. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 31/01/2013, at 7:20 PM, Takenori Sato ts...@cloudian.com wrote: Hi all, We have a situation that CPU loads on some of our nodes in a cluster has spiked occasionally since the last November, which is triggered by requests for rows that reside on two specific sstables. We confirmed the followings(when spiked): version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8 jdk: Oracle 1.6.0 1. a profiling showed that BloomFilterSerializer#deserialize was the hotspot(70% of the total load by running threads) * the stack trace looked like this(simplified) 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow ... 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData ... 89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read ... 79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter 68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize 66.7% - java.io.DataInputStream.readLong 2. Usually, 1 should be so fast that a profiling by sampling can not detect 3. no pressure on Cassandra's VM heap nor on machine in overal 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat 1 1000) 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G) 6. the problematic Filter file size is only 256B(could be normal) So now, I am trying to read the Filter file in the same way BloomFilterSerializer#deserialize does as possible as I can, in order to see if the file is something wrong. Could you give me some advise on: 1. what is happening? 2. the best way to simulate the BloomFilterSerializer#deserialize 3. any more info required to proceed? Thanks, Takenori
CPU hotspot at BloomFilterSerializer#deserialize
Hi all, We have a situation that CPU loads on some of our nodes in a cluster has spiked occasionally since the last November, which is triggered by requests for rows that reside on two specific sstables. We confirmed the followings(when spiked): version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8 jdk: Oracle 1.6.0 1. a profiling showed that BloomFilterSerializer#deserialize was the hotspot(70% of the total load by running threads) * the stack trace looked like this(simplified) 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow ... 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData ... 89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read ... 79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter 68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize 66.7% - java.io.DataInputStream.readLong 2. Usually, 1 should be so fast that a profiling by sampling can not detect 3. no pressure on Cassandra's VM heap nor on machine in overal 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat 1 1000) 5. the problematic Data file contains only 5 to 10 keys data but large(2.4G) 6. the problematic Filter file size is only 256B(could be normal) So now, I am trying to read the Filter file in the same way BloomFilterSerializer#deserialize does as possible as I can, in order to see if the file is something wrong. Could you give me some advise on: 1. what is happening? 2. the best way to simulate the BloomFilterSerializer#deserialize 3. any more info required to proceed? Thanks, Takenori