Re: Re[2]: how wide can wide rows get?

2014-11-13 Thread Takenori Sato
We have up to a few hundreds of millions of columns in a super wide row.

There are two major issues you should care about.

1. the wider the row is, the more memory pressure you get for every slice
query
2. repair is row based, which means a huge row could be transferred at
every repair

1 is not a big issue if you don't have many concurrent slice requests.
Having more cores is a good investment to reduce memory pressure.

2  could cause very high memory pressure as well as poorer disk utilization.


On Fri, Nov 14, 2014 at 3:21 PM, Plotnik, Alexey aplot...@rhonda.ru wrote:

  We have 380k of them in some of our rows and it's ok.

 -- Original Message --
 From: Hannu Kröger hkro...@gmail.com
 To: user@cassandra.apache.org user@cassandra.apache.org
 Sent: 14.11.2014 16:13:49
 Subject: Re: how wide can wide rows get?


 The theoretical limit is maybe 2 billion but recommended max is around
 10-20 thousand.

 Br,
 Hannu

 On 14.11.2014, at 8.10, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   I’m struggling with this wide row business. Is there an upward limit on
 the number of columns you can have?

 Adaryl Bob Wakefield, MBA
 Principal
 Mass Street Analytics
 913.938.6685
 www.linkedin.com/in/bobwakefieldmba
 Twitter: @BobLovesData




A fix for those who suffer from GC storm by tombstones

2014-10-07 Thread Takenori Sato
Hi,

I have filed a fix as CASSANDRA-8038, which would be a good news for those
who has suffered from overwhelming GC or OOM by tombstones.

Appreciate your feedbacks!

Thanks,
Takenori


Re: A fix for those who suffer from GC storm by tombstones

2014-10-07 Thread Takenori Sato
DuyHi and Rob, Thanks for your feedbacks.

Yeah, that's exactly the point I found. Some may want to run read repair even 
on tombstones as before, but others not like Rob and us.

Personally, I take read repaid as a nice to have feature, especially for 
tombstones, where a regular repair is anyway enforced.

So with this fix, I expect that a user can choose a better, manageable risk as 
needed. The good news is, the improvement for performance is significant!

- Takenori

iPhoneから送信

2014/10/08 3:18、Robert Coli rc...@eventbrite.com のメッセージ:

 
 On Tue, Oct 7, 2014 at 1:57 AM, DuyHai Doan doanduy...@gmail.com wrote:
  Read Repair belongs to the Anti-Entropy procedures to ensure that 
 eventually, data from all replicas do converge. Tombstones are data 
 (deletion marker) so they need to be exchanged between replicas. By skipping 
 tombstone you prevent the data convergence with regard to deletion. 
 
 Read repair is an optimization. I would probably just disable it in OP's case 
 and rely entirely on AES repair, because the 8303 approach makes read repair 
 not actually repair in some cases...
 
 =Rob
  


Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram

2014-03-11 Thread Takenori Sato
In addition to the suggestions by Jonathan, you can run a user defined
compaction against a particular set of SSTable files, where you want to
remove tombstones.

But to do that, you need to find such an optimal set. Here you can find a
couple of helpful tools.

https://github.com/cloudian/support-tools


On Mon, Mar 10, 2014 at 7:41 PM, Oleg Dulin oleg.du...@gmail.com wrote:

 I get that :)

 What I'd like to know is how to fix that :)


 On 2014-03-09 20:24:54 +, Takenori Sato said:

  You have millions of org.apache.cassandra.db.DeletedColumn instances on
 the snapshot.

 This means you have lots of column tombstones, and I guess, which are
 read into memory by slice query.


 On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin oleg.du...@gmail.com wrote:
 I am trying to understand why one of my nodes keeps full GC.

 I have Xmx set to 8gigs, memtable total size is 2 gigs.

 Consider the top entries from jmap -histo:live @
 http://pastebin.com/UaatHfpJ

 --
 Regards,
 Oleg Dulin
 http://www.olegdulin.com



 --
 Regards,
 Oleg Dulin
 http://www.olegdulin.com





Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram

2014-03-09 Thread Takenori Sato
You have millions of org.apache.cassandra.db.DeletedColumn instances on the
snapshot.

This means you have lots of column tombstones, and I guess, which are read
into memory by slice query.


On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin oleg.du...@gmail.com wrote:

 I am trying to understand why one of my nodes keeps full GC.

 I have Xmx set to 8gigs, memtable total size is 2 gigs.

 Consider the top entries from jmap -histo:live @
 http://pastebin.com/UaatHfpJ

 --
 Regards,
 Oleg Dulin
 http://www.olegdulin.com





Re: Recommended amount of free disk space for compaction

2013-11-29 Thread Takenori Sato
Hi,

 If Cassandra only compacts one table at a time, then I should be safe if
I keep as much free space as there is data in the largest table. If
Cassandra can compact multiple tables simultaneously, then it seems that I
need as much free space as all the tables put together, which means no more
than 50% utilization.

Based on your configuration. 1 per CPU core by default. See
concurrent_compactors for details.

 Also, what happens if a node gets low on disk space and there isn’t
enough available for compaction?

A compaction checks if there's enough disk space based on its estimate.
Otherwise, it won't get executed.

 Is there a way to salvage a node that gets into a state where it cannot
compact its tables?

If you carefully run some cleanups, then you'll get some room based on its
new range.


On Fri, Nov 29, 2013 at 12:21 PM, Robert Wille rwi...@fold3.com wrote:

 I’m trying to estimate our disk space requirements and I’m wondering about
 disk space required for compaction.

 My application mostly inserts new data and performs updates to existing
 data very infrequently, so there will be very few bytes removed by
 compaction. It seems that if a major compaction occurs, that performing the
 compaction will require as much disk space as is currently consumed by the
 table.

 So here’s my question. If Cassandra only compacts one table at a time,
 then I should be safe if I keep as much free space as there is data in the
 largest table. If Cassandra can compact multiple tables simultaneously,
 then it seems that I need as much free space as all the tables put
 together, which means no more than 50% utilization. So, how much free space
 do I need? Any rules of thumb anyone can offer?

 Also, what happens if a node gets low on disk space and there isn’t enough
 available for compaction? If I add new nodes to reduce the amount of data
 on each node, I assume the space won’t be reclaimed until a compaction
 event occurs. Is there a way to salvage a node that gets into a state where
 it cannot compact its tables?

 Thanks

 Robert




Re: Tracing Queries at Cassandra Server

2013-11-10 Thread Takenori Sato
In addition to CassandraServer, add StorageProxy for details as follows.

log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
log4j.logger.org.apache.cassandra.thrift.CassandraServer=DEBUG

Hope it would help.


On Mon, Nov 11, 2013 at 11:25 AM, Srinath Perera srin...@wso2.com wrote:

 I am talking to Cassandra using Hector. Is there a way that I can trace
 the executed queries at the server?

 I have tired adding Enable DEBUG logging for
 org.apache.cassandra.thrift.CassandraServer as mentioned in Cassandra vs
 logging 
 activityhttp://stackoverflow.com/questions/9604554/cassandra-vs-logging-activity.
 But that does not provide much info (e.g. says slice query executed, but
 does not give more info).

 What I look for is something like SQL tracing in MySQL, so all queries
 executed are logged.

 --Srinath





Re: Cass 1.1.11 out of memory during compaction ?

2013-11-04 Thread Takenori Sato
I would go with cleanup.

Be careful for this bug.
https://issues.apache.org/jira/browse/CASSANDRA-5454


On Mon, Nov 4, 2013 at 9:05 PM, Oleg Dulin oleg.du...@gmail.com wrote:

 If i do that, wouldn't I need to scrub my sstables ?


 Takenori Sato ts...@cloudian.com wrote:
  Try increasing column_index_size_in_kb.
 
  A slice query to get some ranges(SliceFromReadCommand) requires to read
  all the column indexes for the row, thus could hit OOM if you have a
 very wide row.
 
  On Sun, Nov 3, 2013 at 11:54 PM, Oleg Dulin oleg.du...@gmail.com
 wrote:
 
  Cass 1.1.11 ran out of memory on me with this exception (see below).
 
  My parameters are 8gig heap, new gen is 1200M.
 
  ERROR [ReadStage:55887] 2013-11-02 23:35:18,419
  AbstractCassandraDaemon.java (line 132) Exception in thread
  Thread[ReadStage:55887,5,main] java.lang.OutOfMemoryError: Java heap
  space
  at
 org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323)
 
 at org.apache.cassandra.utils.ByteBufferUtil.read(
  ByteBufferUtil.java:398)at
 
 org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380)
 
 at
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88)
 
 at
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83)
 
 at
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73)
 
 at
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37)
 
 at org.apache.cassandra.db.columniterator.IndexedSliceReader$
  IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179)at
  org.apache.cassandra.db.columniterator.IndexedSliceReader.
  computeNext(IndexedSliceReader.java:121)at
  org.apache.cassandra.db.columniterator.IndexedSliceReader.
  computeNext(IndexedSliceReader.java:48)at
 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 
 at org.apache.cassandra.db.columniterator.
  SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116)at
  org.apache.cassandra.utils.MergeIterator$Candidate.
  advance(MergeIterator.java:147)at
  org.apache.cassandra.utils.MergeIterator$ManyToOne.
  advance(MergeIterator.java:126)at
  org.apache.cassandra.utils.MergeIterator$ManyToOne.
  computeNext(MergeIterator.java:100)at
 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
 
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
 
 at org.apache.cassandra.db.filter.SliceQueryFilter.
  collectReducedColumns(SliceQueryFilter.java:117)at
  org.apache.cassandra.db.filter.QueryFilter.
  collateColumns(QueryFilter.java:140)
  at
 org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292)
 
 at
 
 org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64)
 
 at
 
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362)
 
 at
 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224)
 
 at
 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159)
 
 at org.apache.cassandra.db.Table.getRow(Table.java:378)at
 
 org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
 
 at org.apache.cassandra.db.ReadVerbHandler.doVerb(
  ReadVerbHandler.java:51)at
 
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 
 at java.lang.Thread.run(Thread.java:722)
 
  Any thoughts ?
 
  This is a dual data center set up, with 4 nodes in each DC and RF=2 in
 each.
 
  --
  Regards,
  Oleg Dulin a href=http://www.olegdulin.com;http://www.olegdulin.com
 /a




Re: Cass 1.1.11 out of memory during compaction ?

2013-11-03 Thread Takenori Sato
Try increasing column_index_size_in_kb.

A slice query to get some ranges(SliceFromReadCommand) requires to read all
the column indexes for the row, thus could hit OOM if you have a very wide
row.



On Sun, Nov 3, 2013 at 11:54 PM, Oleg Dulin oleg.du...@gmail.com wrote:

 Cass 1.1.11 ran out of memory on me with this exception (see below).

 My parameters are 8gig heap, new gen is 1200M.

 ERROR [ReadStage:55887] 2013-11-02 23:35:18,419
 AbstractCassandraDaemon.java (line 132) Exception in thread
 Thread[ReadStage:55887,5,main]
 java.lang.OutOfMemoryError: Java heap space
at 
 org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323)

at org.apache.cassandra.utils.ByteBufferUtil.read(
 ByteBufferUtil.java:398)
at 
 org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380)

at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88)

at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83)

at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73)

at 
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37)

at org.apache.cassandra.db.columniterator.IndexedSliceReader$
 IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179)
at org.apache.cassandra.db.columniterator.IndexedSliceReader.
 computeNext(IndexedSliceReader.java:121)
at org.apache.cassandra.db.columniterator.IndexedSliceReader.
 computeNext(IndexedSliceReader.java:48)
at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)

at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)

at org.apache.cassandra.db.columniterator.
 SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116)
at org.apache.cassandra.utils.MergeIterator$Candidate.
 advance(MergeIterator.java:147)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.
 advance(MergeIterator.java:126)
at org.apache.cassandra.utils.MergeIterator$ManyToOne.
 computeNext(MergeIterator.java:100)
at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)

at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)

at org.apache.cassandra.db.filter.SliceQueryFilter.
 collectReducedColumns(SliceQueryFilter.java:117)
at org.apache.cassandra.db.filter.QueryFilter.
 collateColumns(QueryFilter.java:140)
at 
 org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292)

at 
 org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64)

at 
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362)

at 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224)

at 
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159)

at org.apache.cassandra.db.Table.getRow(Table.java:378)
at 
 org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)

at org.apache.cassandra.db.ReadVerbHandler.doVerb(
 ReadVerbHandler.java:51)
at 
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)

at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722)


 Any thoughts ?

 This is a dual data center set up, with 4 nodes in each DC and RF=2 in
 each.


 --
 Regards,
 Oleg Dulin
 http://www.olegdulin.com





Re: questions related to the SSTable file

2013-09-17 Thread Takenori Sato
 So in fact, incremental backup of Cassandra is just hard link all the new
SSTable files being generated during the incremental backup period. It
could contain any data, not just the data being update/insert/delete in
this period, correct?

Correct.

But over time, some old enough SSTable files are usually shared across
multiple snapshots.


On Wed, Sep 18, 2013 at 3:37 AM, java8964 java8964 java8...@hotmail.comwrote:

 Another question related to the SSTable files generated in the incremental
 backup is not really ONLY incremental delta, right? It will include more
 than delta in the SSTable files.

 I will use the example to show my question:

 first, we have this data in the SSTable file 1:

 rowkey(1), columns (maker=honda).

 later, if we add one column in the same key:

 rowkey(1), columns (maker=honda, color=blue)

 The data above being flushed to another SSTable file 2. In this case, it
 will be part of the incremental backup at this time. But in fact, it will
 contain both old data (make=honda), plus new changes (color=blue).

 So in fact, incremental backup of Cassandra is just hard link all the new
 SSTable files being generated during the incremental backup period. It
 could contain any data, not just the data being update/insert/delete in
 this period, correct?

 Thanks

 Yong

  From: dean.hil...@nrel.gov
  To: user@cassandra.apache.org
  Date: Tue, 17 Sep 2013 08:11:36 -0600

  Subject: Re: questions related to the SSTable file
 
  Netflix created file streaming in astyanax into cassandra specifically
 because writing too big a column cell is a bad thing. The limit is really
 dependent on use case….do you have servers writing 1000's of 200Meg files
 at the same time….if so, astyanax streaming may be a better way to go there
 where it divides up the file amongst cells and rows.
 
  I know the limit of a row size is really your hard disk space and the
 column count if I remember goes into billions though realistically, I think
 beyond 10 million might slow down a bit….all I know is we tested up to 10
 million columns with no issues in our use-case.
 
  So you mean at this time, I could get 2 SSTable files, both contain
 column Blue for the same row key, right?
 
  Yes
 
  In this case, I should be fine as value of the Blue column contain the
 timestamp to help me to find out which is the last change, right?
 
  Yes
 
  In MR world, each file COULD be processed by different Mapper, but will
 be sent to the same reducer as both data will be shared same key.
 
  If that is the way you are writing it, then yes
 
  Dean
 
  From: Shahab Yunus shahab.yu...@gmail.commailto:shahab.yu...@gmail.com
 
  Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Date: Tuesday, September 17, 2013 7:54 AM
  To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Subject: Re: questions related to the SSTable file
 
  derstand if following changes apply to the same row key as above
 example, additional SSTable file could be generated. That is



Re: questions related to the SSTable file

2013-09-17 Thread Takenori Sato(Cloudian)

Thanks, Rob, for clarifying!

- Takenori

(2013/09/18 10:01), Robert Coli wrote:
On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato ts...@cloudian.com 
mailto:ts...@cloudian.com wrote:


 So in fact, incremental backup of Cassandra is just hard link
all the new SSTable files being generated during the incremental
backup period. It could contain any data, not just the data being
update/insert/delete in this period, correct?

Correct.

But over time, some old enough SSTable files are usually shared
across multiple snapshots.


To be clear, incremental backup feature backs up the data being 
modified in that period, because it writes only those files to the 
incremental backup dir as hard links, between full snapshots.


http://www.datastax.com/docs/1.0/operations/backup_restore

When incremental backups are enabled (disabled by default), Cassandra 
hard-links each flushed SSTable to a backups directory under the 
keyspace data directory. This allows you to store backups offsite 
without transferring entire snapshots. Also, incremental backups 
combine with snapshots to provide a dependable, up-to-date backup 
mechanism.



What Takenori is referring to is that a full snapshot is in some ways 
an incremental backup because it shares hard linked SSTables with 
other snapshots.


=Rob




Re: questions related to the SSTable file

2013-09-17 Thread Takenori Sato
Yong,

It seems there is still a misunderstanding.

 But there is no way we can be sure that these SSTable files will ONLY
contain modified data. So the statement being quoted above is not exactly
right. I agree that all the modified data in that period will be in the
incremental sstable files, but a lot of other unmodified data will be in
them too.

memtable(a new sstable) contains only modified data as I explained by the
example.

 If we have 2 rows data with different row key in the same memtable, and
if only 2nd row being modified. When the memtable is flushed to SSTable
file, it will contain both rows, and both will be in the incremental backup
files. So for first row, nothing change, but it will be in the incremental
backup.

Unless the first row is modified, it does not exist in memtable at all.

 If I have one row with one column, now a new column is added, and whole
row in one memtable being flushed to SSTable file, as also in this
incremental backup. For first column, nothing change, but it will still be
in incremental backup file.

For example, if it works as you understand, then, Color-2 should contain
two more rows, Lavender, and Blue with an existing column, hex, like the
following. But it's not.

- Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #FF}}]
- Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]

-- your understanding
- Color-2-Data.db: [{Lavender: {hex: #E6E6FA}}, {Green: {hex: #008000}},
{Blue: {hex: #FF}, {hex2: #2c86ff}}]
* Row, Lavender, and Column Blue's hex have no changes


 The point I tried to make is this is important if I design an ETL to
consume the incremental backup SSTable files. As above example, I have to
realize that in the incremental backup sstable files, they could or most
likely contain old data which was previous being processed already. That
will require additional logic and responsibility in the ETL to handle it,
or any outsider SSTable consumer to pay attention to it.

I suggest to try org.apache.cassandra.tools.SSTableExport, then you will
see what's going on under the hood.

- Takenori








On Wed, Sep 18, 2013 at 10:51 AM, java8964 java8964 java8...@hotmail.comwrote:

 Quote:

 
 To be clear, incremental backup feature backs up the data being modified
 in that period, because it writes only those files to the incremental
 backup dir as hard links, between full snapshots.
 

 I thought I was clearer, but your clarification confused me again.
 My understanding so far from all the answer I got so far, I believe, the
 more accurate statement of incremental backup should be incremental
 backup feature backs up the SSTable files being generated in that period.

 But there is no way we can be sure that these SSTable files will ONLY
 contain modified data. So the statement being quoted above is not exactly
 right. I agree that all the modified data in that period will be in the
 incremental sstable files, but a lot of other unmodified data will be in
 them too.

 If we have 2 rows data with different row key in the same memtable, and if
 only 2nd row being modified. When the memtable is flushed to SSTable file,
 it will contain both rows, and both will be in the incremental backup
 files. So for first row, nothing change, but it will be in the incremental
 backup.

 If I have one row with one column, now a new column is added, and whole
 row in one memtable being flushed to SSTable file, as also in this
 incremental backup. For first column, nothing change, but it will still be
 in incremental backup file.

 The point I tried to make is this is important if I design an ETL to
 consume the incremental backup SSTable files. As above example, I have to
 realize that in the incremental backup sstable files, they could or most
 likely contain old data which was previous being processed already. That
 will require additional logic and responsibility in the ETL to handle it,
 or any outsider SSTable consumer to pay attention to it.

 Yong

 --
 Date: Tue, 17 Sep 2013 18:01:45 -0700

 Subject: Re: questions related to the SSTable file
 From: rc...@eventbrite.com
 To: user@cassandra.apache.org


 On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato ts...@cloudian.com wrote:

  So in fact, incremental backup of Cassandra is just hard link all the
 new SSTable files being generated during the incremental backup period. It
 could contain any data, not just the data being update/insert/delete in
 this period, correct?

 Correct.

 But over time, some old enough SSTable files are usually shared across
 multiple snapshots.


 To be clear, incremental backup feature backs up the data being modified
 in that period, because it writes only those files to the incremental
 backup dir as hard links, between full snapshots.

 http://www.datastax.com/docs/1.0/operations/backup_restore
 
 When incremental backups are enabled (disabled by default), Cassandra
 hard-links each flushed SSTable to a backups directory under

Re: questions related to the SSTable file

2013-09-16 Thread Takenori Sato(Cloudian)

Hi,

 1) I will expect same row key could show up in both sstable2json 
output, as this one row exists in both SSTable files, right?


Yes.

 2) If so, what is the boundary? Will Cassandra guarantee the column 
level as the boundary? What I mean is that for one column's data, it 
will be guaranteed to be either in the first file, or 2nd file, right? 
There is no chance that Cassandra will cut the data of one column into 2 
part, and one part stored in first SSTable file, and the other part 
stored in second SSTable file. Is my understanding correct?


No.

 3) If what we are talking about are only the SSTable files in 
snapshot, incremental backup SSTable files, exclude the runtime SSTable 
files, will anything change? For snapshot or incremental backup SSTable 
files, first can one row data still may exist in more than one SSTable 
file? And any boundary change in this case?
 4) If I want to use incremental backup SSTable files as the way to 
catch data being changed, is it a good way to do what I try to archive? 
In this case, what happen in the following example:


I don't fully understand, but snapshot will do. It will create hard 
links to all the SSTable files present at snapshot.



Let me explain how SSTable and compaction works.

Suppose we have 4 files being compacted(the last one has bee just 
flushed, then which triggered compaction). Note that file names are 
simplified.


- Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #FF}}]
- Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
- Color-3-Data.db: [{Aqua: {hex: #00}}, {Green: {hex2: #32CD32}}, 
{Blue: {}}]

- Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]

They are created by the following operations.

- Add a row of (key, column, column_value = Blue, hex, #FF)
- Add a row of (key, column, column_value = Lavender, hex, #E6E6FA)
 memtable is flushed = Color-1-Data.db 
- Add a row of (key, column, column_value = Green, hex, #008000)
- Add a column of (key, column, column_value = Blue, hex2, #2c86ff)
 memtable is flushed = Color-2-Data.db 
- Add a column of (key, column, column_value = Green, hex2, #32CD32)
- Add a row of (key, column, column_value = Aqua, hex, #00)
- Delete a row of (key = Blue)
 memtable is flushed = Color-3-Data.db 
- Add a row of (key, column, column_value = Magenta, hex, #FF00FF)
- Add a row of (key, column, column_value = Gold, hex, #FFD700)
 memtable is flushed = Color-4-Data.db 

Then, a compaction will merge all those fragments together into the 
latest ones as follows.


- Color-5-Data.db: [{Lavender: {hex: #E6E6FA}, {Aqua: {hex: #00}, 
{Green: {hex: #008000, hex2: #32CD32}}, {Magenta: {hex: #FF00FF}}, 
{Gold: {hex: #FFD700}}]

* assuming RandomPartitioner is used

Hope they would help.

- Takenori

(2013/09/17 10:51), java8964 java8964 wrote:
Hi, I have some questions related to the SSTable in the Cassandra, as 
I am doing a project to use it and hope someone in this list can share 
some thoughts.


My understand is the SSTable is per column family. But each column 
family could have multi SSTable files. During the runtime, one row 
COULD split into more than one SSTable file, even this is not good for 
performance, but it does happen, and Cassandra will try to merge and 
store one row data into one SSTable file during compassion.


The question is when one row is split in multi SSTable files, what is 
the boundary? Or let me ask this way, if one row exists in 2 SSTable 
files, if I run sstable2json tool to run on both SSTable files 
individually:


1) I will expect same row key could show up in both sstable2json 
output, as this one row exists in both SSTable files, right?
2) If so, what is the boundary? Will Cassandra guarantee the column 
level as the boundary? What I mean is that for one column's data, it 
will be guaranteed to be either in the first file, or 2nd file, right? 
There is no chance that Cassandra will cut the data of one column into 
2 part, and one part stored in first SSTable file, and the other part 
stored in second SSTable file. Is my understanding correct?
3) If what we are talking about are only the SSTable files in 
snapshot, incremental backup SSTable files, exclude the runtime 
SSTable files, will anything change? For snapshot or incremental 
backup SSTable files, first can one row data still may exist in more 
than one SSTable file? And any boundary change in this case?
4) If I want to use incremental backup SSTable files as the way to 
catch data being changed, is it a good way to do what I try to 
archive? In this case, what happen in the following example:


For column family A:
at Time 0, one row key (key1) has some data. It is being stored and 
back up in SSTable file 1.
at Time 1, if any column for key1 has any change (a new column insert, 
a column updated/deleted, or even whole row being deleted), I will 
expect this whole row exists in the any incremental backup SSTable 
files 

/proc/sys/vm/zone_reclaim_mode

2013-09-09 Thread Takenori Sato
Hi,

I am investigating NUMA issues.

I have been aware that bin/cassandra tries to use interleave all policy if
available.

https://issues.apache.org/jira/browse/CASSANDRA-2594
https://issues.apache.org/jira/browse/CASSANDRA-3245

So what about /proc/sys/vm/zone_reclaim_mode? Any recommendations? I didn't
find any in respect to Cassandra.

By default on Linux NUMA machine, this is set 1 that tries to reclaim some
pages in a zone rather than acquiring others from the other zones.

Explicitly disabling this sounds better.

It may be beneficial to switch off zone reclaim if the system is used for
a file server and all of memory should be used for caching files from disk.
In that case the caching effect is more important than data locality.
https://www.kernel.org/doc/Documentation/sysctl/vm.txt

Thanks!
Takenori


Re: Random Distribution, yet Order Preserving Partitioner

2013-08-27 Thread Takenori Sato
Hi Manoj,

Thanks for your advise.

More or less, basically we do the same. As you pointed out, we now face
with many cases that can not be solved by data modeling, and which are
reaching to 100 millions of columns.

We can split them down to multiple pieces of metadata rows, but that will
bring more complexity, thus error prone. If possible, want to avoid that.

- Takenori

2013/08/27 21:37、Manoj Mainali mainalima...@gmail.com のメッセージ:

Hi Takenori,

I can't tell for sure without knowing what kind of data you have and how
much you have.You can use the random partitioner and use the concept of
metadata row that stores the row key, as for example like below

{metadata_row}: key1 | key2 | key3
key1:column1 | column2

 When you do the read you can always directly query by the key, if you
already know it. In the case of range queries, first you query the
metadata_row and get the keys you want in the ordered fashion. Then you can
do multi_get to get you actual data.

The downside is you have to do two read queries, and depending on how much
data you have you will end up with a wide metadata row.

Manoj


On Fri, Aug 23, 2013 at 8:47 AM, Takenori Sato ts...@cloudian.com wrote:

 Hi Nick,

  token and key are not same. it was like this long time ago (single MD5
 assumed single key)

 True. That reminds me of making a test with the latest 1.2 instead of our
 current 1.0!

  if you want ordered, you probably can arrange your data in a way so you
 can get it in ordered fashion.

 Yeah, we have done for a long time. That's called a wide row, right? Or a
 compound primary key.

 It can handle some millions of columns, but not more like 10M. I mean, a
 request for such a row concentrates on a particular node, so the
 performance degrades.

  I also had idea for semi-ordered partitioner - instead of single MD5,
 to have two MD5's.

 Sounds interesting. But, we need a fully ordered result.

 Anyway, I will try with the latest version.

 Thanks,
 Takenori


 On Thu, Aug 22, 2013 at 6:12 PM, Nikolay Mihaylov n...@nmmm.nu wrote:

 my five cents -
 token and key are not same. it was like this long time ago (single MD5
 assumed single key)

 if you want ordered, you probably can arrange your data in a way so you
 can get it in ordered fashion.
 for example long ago, i had single column family with single key and
 about 2-3 M columns - I do not suggest you to do it this way, because is
 wrong way, but it is easy to understand the idea.

 I also had idea for semi-ordered partitioner - instead of single MD5, to
 have two MD5's.
 then you can get semi-ordered ranges, e.g. you get ordered all cities in
 Canada, all cities in US and so on.
 however in this way things may get pretty non-ballanced

 Nick





 On Thu, Aug 22, 2013 at 11:19 AM, Takenori Sato ts...@cloudian.comwrote:

 Hi,

 I am trying to implement a custom partitioner that evenly distributes,
 yet preserves order.

 The partitioner returns a token by BigInteger as RandomPartitioner does,
 while does a decorated key by string as OrderPreservingPartitioner does.
 * for now, since IPartitionerT does not support different types for
 token and key, BigInteger is simply converted to string

 Then, I played around with cassandra-cli. As expected, in my 3 nodes
 test cluster, get/set worked, but list(get_range_slices) didn't.

 This came from a challenge to overcome a wide row scalability. So, I
 want to make it work!

 I am aware that some efforts are required to make get_range_slices work.
 But are there any other critical problems? For example, it seems there is
 an assumption that token and key are the same. If this is throughout the
 whole C* code, this partitioner is not practical.

 Or have your tried something similar?

 I would appreciate your feedback!

 Thanks,
 Takenori






Re: OrderPreservingPartitioner in 1.2

2013-08-25 Thread Takenori Sato(Cloudian)

From the Jira,

 One possibility is that getToken of OPP can return hex value if it 
fails to encode bytes to UTF-8 instead of throwing error. By this system 
tables seem to be working fine with OPP.


This looks like an option to try for me.

Thanks!

(2013/08/23 20:44), Vara Kumar wrote:
For the first exception: OPP was not working in 1.2. It has been fixed 
but not yet there in latest 1.2.8 version.


Jira issue about it: https://issues.apache.org/jira/browse/CASSANDRA-5793


On Fri, Aug 23, 2013 at 12:51 PM, Takenori Sato ts...@cloudian.com 
mailto:ts...@cloudian.com wrote:


Hi,

I know it has been depreciated, but OrderPreservingPartitioner
still works with 1.2?

Just wanted to know how it works, but I got a couple of exceptions
as below:

ERROR [GossipStage:2] 2013-08-23 07:03:57,171 CassandraDaemon.java
(line 175) Exception in thread Thread[GossipStage:2,5,main]
java.lang.RuntimeException: The provided key was not UTF8 encoded.
at

org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:233)
at

org.apache.cassandra.dht.OrderPreservingPartitioner.decorateKey(OrderPreservingPartitioner.java:53)
at org.apache.cassandra.db.Table.apply(Table.java:379)
at org.apache.cassandra.db.Table.apply(Table.java:353)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:258)
at

org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:117)
at

org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:172)
at
org.apache.cassandra.db.SystemTable.updatePeerInfo(SystemTable.java:258)
at

org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228)
at
org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:935)
at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:926)
at
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:884)
at

org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:57)
at

org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:781)
at
org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167)
at
org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:124)
at

org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:229)
... 16 more

The key was 0ab68145 in HEX, that contains some control characters.

Another exception is this:

 INFO [main] 2013-08-23 07:04:27,659 StorageService.java (line
891) JOINING: Starting to bootstrap...
DEBUG [main] 2013-08-23 07:04:27,659 BootStrapper.java (line 73)
Beginning bootstrap process
ERROR [main] 2013-08-23 07:04:27,666 CassandraDaemon.java (line
430) Exception encountered during startup
java.lang.IllegalStateException: No sources found for (H,H]
at

org.apache.cassandra.dht.RangeStreamer.getAllRangesWithSourcesFor(RangeStreamer.java:163)
at
org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:121)
at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
at

org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
at

org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
at

org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
at

org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
at

org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
ERROR [StorageServiceShutdownHook] 2013-08-23 07:04:27,672
CassandraDaemon.java (line 175) Exception in thread
Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at

org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at

org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362)
at

org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at

org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513

OrderPreservingPartitioner in 1.2

2013-08-23 Thread Takenori Sato
Hi,

I know it has been depreciated, but OrderPreservingPartitioner still works
with 1.2?

Just wanted to know how it works, but I got a couple of exceptions as below:

ERROR [GossipStage:2] 2013-08-23 07:03:57,171 CassandraDaemon.java (line
175) Exception in thread Thread[GossipStage:2,5,main]
java.lang.RuntimeException: The provided key was not UTF8 encoded.
at
org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:233)
at
org.apache.cassandra.dht.OrderPreservingPartitioner.decorateKey(OrderPreservingPartitioner.java:53)
at org.apache.cassandra.db.Table.apply(Table.java:379)
at org.apache.cassandra.db.Table.apply(Table.java:353)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:258)
at
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:117)
at
org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:172)
at org.apache.cassandra.db.SystemTable.updatePeerInfo(SystemTable.java:258)
at
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228)
at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:935)
at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:926)
at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:884)
at
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:57)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:781)
at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167)
at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:124)
at
org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:229)
... 16 more

The key was 0ab68145 in HEX, that contains some control characters.

Another exception is this:

 INFO [main] 2013-08-23 07:04:27,659 StorageService.java (line 891)
JOINING: Starting to bootstrap...
DEBUG [main] 2013-08-23 07:04:27,659 BootStrapper.java (line 73) Beginning
bootstrap process
ERROR [main] 2013-08-23 07:04:27,666 CassandraDaemon.java (line 430)
Exception encountered during startup
java.lang.IllegalStateException: No sources found for (H,H]
at
org.apache.cassandra.dht.RangeStreamer.getAllRangesWithSourcesFor(RangeStreamer.java:163)
at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:121)
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
ERROR [StorageServiceShutdownHook] 2013-08-23 07:04:27,672
CassandraDaemon.java (line 175) Exception in thread
Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at
org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at
org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362)
at
org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:662)

I tried to setup 3 nodes cluster with tokens, A, H, P for each. This error
was raised by the second node with the token, H.

Thanks,
Takenori


Re: Random Distribution, yet Order Preserving Partitioner

2013-08-22 Thread Takenori Sato
Hi Nick,

 token and key are not same. it was like this long time ago (single MD5
assumed single key)

True. That reminds me of making a test with the latest 1.2 instead of our
current 1.0!

 if you want ordered, you probably can arrange your data in a way so you
can get it in ordered fashion.

Yeah, we have done for a long time. That's called a wide row, right? Or a
compound primary key.

It can handle some millions of columns, but not more like 10M. I mean, a
request for such a row concentrates on a particular node, so the
performance degrades.

 I also had idea for semi-ordered partitioner - instead of single MD5, to
have two MD5's.

Sounds interesting. But, we need a fully ordered result.

Anyway, I will try with the latest version.

Thanks,
Takenori


On Thu, Aug 22, 2013 at 6:12 PM, Nikolay Mihaylov n...@nmmm.nu wrote:

 my five cents -
 token and key are not same. it was like this long time ago (single MD5
 assumed single key)

 if you want ordered, you probably can arrange your data in a way so you
 can get it in ordered fashion.
 for example long ago, i had single column family with single key and about
 2-3 M columns - I do not suggest you to do it this way, because is wrong
 way, but it is easy to understand the idea.

 I also had idea for semi-ordered partitioner - instead of single MD5, to
 have two MD5's.
 then you can get semi-ordered ranges, e.g. you get ordered all cities in
 Canada, all cities in US and so on.
 however in this way things may get pretty non-ballanced

 Nick





 On Thu, Aug 22, 2013 at 11:19 AM, Takenori Sato ts...@cloudian.comwrote:

 Hi,

 I am trying to implement a custom partitioner that evenly distributes,
 yet preserves order.

 The partitioner returns a token by BigInteger as RandomPartitioner does,
 while does a decorated key by string as OrderPreservingPartitioner does.
 * for now, since IPartitionerT does not support different types for
 token and key, BigInteger is simply converted to string

 Then, I played around with cassandra-cli. As expected, in my 3 nodes test
 cluster, get/set worked, but list(get_range_slices) didn't.

 This came from a challenge to overcome a wide row scalability. So, I want
 to make it work!

 I am aware that some efforts are required to make get_range_slices work.
 But are there any other critical problems? For example, it seems there is
 an assumption that token and key are the same. If this is throughout the
 whole C* code, this partitioner is not practical.

 Or have your tried something similar?

 I would appreciate your feedback!

 Thanks,
 Takenori





Fp chance for column level bloom filter

2013-07-17 Thread Takenori Sato
Hi,

I thought memory consumption of column level bloom filter will become a big
concern when a row becomes very wide like more than tens of millions of
columns.

But I read from source(1.0.7) that fp chance for column level bloom filter
is hard-coded as 0.160, which is very high. So seems not.

Is this correct?

Thanks,
Takenori


Re: Alternate major compaction

2013-07-12 Thread Takenori Sato
It's light. Without -v option, you can even run it against just a SSTable
file without needing the whole Cassandra installation.

- Takenori


On Sat, Jul 13, 2013 at 6:18 AM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Jul 11, 2013 at 9:43 PM, Takenori Sato ts...@cloudian.com wrote:

 I made the repository public. Now you can checkout from here.

 https://github.com/cloudian/support-tools

 checksstablegarbage is the tool.

 Enjoy, and any feedback is welcome.


 Thanks very much, useful tool!

 Out of curiousity, what does writesstablekeys do that the upstream tool
 sstablekeys does not?

 =Rob



Re: Alternate major compaction

2013-07-11 Thread Takenori Sato
Hi,

I made the repository public. Now you can checkout from here.

https://github.com/cloudian/support-tools

checksstablegarbage is the tool.

Enjoy, and any feedback is welcome.

Thanks,
- Takenori


On Thu, Jul 11, 2013 at 10:12 PM, srmore comom...@gmail.com wrote:

 Thanks Takenori,
 Looks like the tool provides some good info that people can use. It would
 be great if you can share it with the community.




 On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato ts...@cloudian.com wrote:

 Hi,

 I think it is a common headache for users running a large Cassandra
 cluster in production.


 Running a major compaction is not the only cause, but more. For example,
 I see two typical scenario.

 1. backup use case
 2. active wide row

 In the case of 1, say, one data is removed a year later. This means,
 tombstone on the row is 1 year away from the original row. To remove an
 expired row entirely, a compaction set has to include all the rows. So,
 when do the original, 1 year old row, and the tombstoned row are included
 in a compaction set? It is likely to take one year.

 In the case of 2, such an active wide row exists in most of sstable
 files. And it typically contains many expired columns. But none of them
 wouldn't be removed entirely because a compaction set practically do not
 include all the row fragments.


 Btw, there is a very convenient MBean API is available. It is
 CompactionManager's forceUserDefinedCompaction. You can invoke a minor
 compaction on a file set you define. So the question is how to find an
 optimal set of sstable files.

 Then, I wrote a tool to check garbage, and print outs some useful
 information to find such an optimal set.

 Here's a simple log output.

 # /opt/cassandra/bin/checksstablegarbage -e 
 /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db
 [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
 300(1373504071)]
 ===
 ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
 REMAINNING_SSTABLE_FILES
 ===
 hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db
 ---
 TOTAL, 40, 40
 ===

 REMAINNING_SSTABLE_FILES means any other sstable files that contain the
 respective row. So, the following is an optimal set.

 # /opt/cassandra/bin/checksstablegarbage -e 
 /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db 
 /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db
 [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
 300(1373504131)]
 ===
 ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
 REMAINNING_SSTABLE_FILES
 ===
 hello5/100.txt.1373502926003, 223, 0, YES, YES
 ---
 TOTAL, 223, 0
 ===

 This tool relies on SSTableReader and an aggregation iterator as
 Cassandra does in compaction. I was considering to share this with the
 community. So let me know if anyone is interested.

 Ah, note that it is based on 1.0.7. So I will need to check and update
 for newer versions.

 Thanks,
 Takenori


 On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez 
 tomas.nu...@groupalia.comwrote:

 Hi

 About a year ago, we did a major compaction in our cassandra cluster (a
 n00b mistake, I know), and since then we've had huge sstables that never
 get compacted, and we were condemned to repeat the major compaction process
 every once in a while (we are using SizeTieredCompaction strategy, and
 we've not avaluated yet LeveledCompaction, because it has its downsides,
 and we've had no time to test all of them in our environment).

 I was trying to find a way to solve this situation (that is, do
 something like a major compaction that writes small sstables, not huge as
 major compaction does), and I couldn't find it in the documentation. I
 tried cleanup and scrub/upgradesstables, but they don't do that (as
 documentation states). Then I tried deleting all data in a node and then
 bootstrapping it (or nodetool rebuild-ing it), hoping that this way the
 sstables would get cleaned from deleted records and updates. But the
 deleted node just copied the sstables from another node as they were,
 cleaning nothing.

 So I tried a new approach: I switched the sstable compaction strategy
 (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch,
 and then switching it back (Leveled to SizeTiered). It took a while (but so
 do the major compaction process) and it worked, I have smaller sstables

Re: Reduce Cassandra GC

2013-06-19 Thread Takenori Sato
GC options are not set. You should see the followings.

 -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure
-Xloggc:/var/log/cassandra/gc-1371603607.log

 Is it normal to have two processes like this?

No. You are running two processes.


On Wed, Jun 19, 2013 at 4:16 PM, Joel Samuelsson
samuelsson.j...@gmail.comwrote:

 My Cassandra ps info:

 root 26791 1  0 07:14 ?00:00:00 /usr/bin/jsvc -user
 cassandra -home /opt/java/64/jre1.6.0_32/bin/../ -pidfile
 /var/run/cassandra.pid -errfile 1 -outfile /var/log/cassandra/output.log
 -cp
 /usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang-2.6.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/guava-13.0.1.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/jna.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.7.0.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.1.0.jar:/usr/share/cassandra/lib/metrics-core-2.0.3.jar:/usr/share/cassandra/lib/netty-3.5.9.Final.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar:/usr/share/cassandra/lib/snappy-java-1.0.4.1.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/apache-cassandra-1.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-1.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar:/usr/share/java/jna.jar:/etc/cassandra:/usr/share/java/commons-daemon.jar
 -Dlog4j.configuration=log4j-server.properties
 -Dlog4j.defaultInitOverride=true
 -XX:HeapDumpPath=/var/lib/cassandra/java_1371626058.hprof
 -XX:ErrorFile=/var/lib/cassandra/hs_err_1371626058.log -ea
 -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
 -XX:ThreadPriorityPolicy=42 -Xms4004M -Xmx4004M -Xmn800M
 -XX:+HeapDumpOnOutOfMemoryError -Xss180k -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB
 -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false
 org.apache.cassandra.service.CassandraDaemon
 103  26792 26791 99 07:14 ?854015-22:02:22 /usr/bin/jsvc -user
 cassandra -home /opt/java/64/jre1.6.0_32/bin/../ -pidfile
 /var/run/cassandra.pid -errfile 1 -outfile /var/log/cassandra/output.log
 -cp
 /usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang-2.6.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/guava-13.0.1.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/jna.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.7.0.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.1.0.jar:/usr/share/cassandra/lib/metrics-core-2.0.3.jar:/usr/share/cassandra/lib/netty-3.5.9.Final.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar:/usr/share/cassandra/lib/snappy-java-1.0.4.1.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/apache-cassandra-1.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-1.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar:/usr/share/java/jna.jar:/etc/cassandra:/usr/share/java/commons-daemon.jar
 -Dlog4j.configuration=log4j-server.properties
 -Dlog4j.defaultInitOverride=true
 -XX:HeapDumpPath=/var/lib/cassandra/java_1371626058.hprof
 -XX:ErrorFile=/var/lib/cassandra/hs_err_1371626058.log -ea
 -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
 -XX:ThreadPriorityPolicy=42 -Xms4004M -Xmx4004M -Xmn800M
 

Re: Reduce Cassandra GC

2013-06-18 Thread Takenori Sato
 0,0
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line
 116) testing_Keyspace.cf19 0,0
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line
 116) testing_Keyspace.cf20 0,0
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line
 116) testing_Keyspace.cf21 0,0
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
 116) testing_Keyspace.cf22 0,0
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
 116) OpsCenter.rollups7200 0,0
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
 116) OpsCenter.rollups864000,0
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
 116) OpsCenter.rollups60 13745,3109686
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
 116) OpsCenter.events   18,826
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,518 StatusLogger.java (line
 116) OpsCenter.rollups300  2516,570931
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,519 StatusLogger.java (line
 116) OpsCenter.pdps9072,160850
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,519 StatusLogger.java (line
 116) OpsCenter.events_timeline3,86
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,520 StatusLogger.java (line
 116) OpsCenter.settings0,0

 And from gc-1371454124.log I get:
 2013-06-17T08:11:22.300+: 2551.288: [GC 870971K-216494K(4018176K),
 145.1887460 secs]


 2013/6/18 Takenori Sato ts...@cloudian.com

 Find promotion failure. Bingo if it happened at the time.

 Otherwise, post the relevant portion of the log here. Someone may find a
 hint.


 On Mon, Jun 17, 2013 at 5:51 PM, Joel Samuelsson 
 samuelsson.j...@gmail.com wrote:

 Just got a very long GC again. What am I to look for in the logging I
 just enabled?


 2013/6/17 Joel Samuelsson samuelsson.j...@gmail.com

  If you are talking about 1.2.x then I also have memory problems on
 the idle cluster: java memory constantly slow grows up to limit, then spend
 long time for GC. I never seen such behaviour for  1.0.x and 1.1.x, where
 on idle cluster java memory stay on the same value.

 No I am running Cassandra 1.1.8.

  Can you paste you gc config?

 I believe the relevant configs are these:
 # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly

 I haven't changed anything in the environment config up until now.

  Also can you take a heap dump at 2 diff points so that we can
 compare it?

 I can't access the machine at all during the stop-the-world freezes.
 Was that what you wanted me to try?

  Uncomment the followings in cassandra-env.sh.
 Done. Will post results as soon as I get a new stop-the-world gc.

  If you are unable to find a JIRA, file one

 Unless this turns out to be a problem on my end, I will.







Re: Reduce Cassandra GC

2013-06-17 Thread Takenori Sato
Find promotion failure. Bingo if it happened at the time.

Otherwise, post the relevant portion of the log here. Someone may find a
hint.


On Mon, Jun 17, 2013 at 5:51 PM, Joel Samuelsson
samuelsson.j...@gmail.comwrote:

 Just got a very long GC again. What am I to look for in the logging I just
 enabled?


 2013/6/17 Joel Samuelsson samuelsson.j...@gmail.com

  If you are talking about 1.2.x then I also have memory problems on the
 idle cluster: java memory constantly slow grows up to limit, then spend
 long time for GC. I never seen such behaviour for  1.0.x and 1.1.x, where
 on idle cluster java memory stay on the same value.

 No I am running Cassandra 1.1.8.

  Can you paste you gc config?

 I believe the relevant configs are these:
 # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly

 I haven't changed anything in the environment config up until now.

  Also can you take a heap dump at 2 diff points so that we can compare
 it?

 I can't access the machine at all during the stop-the-world freezes. Was
 that what you wanted me to try?

  Uncomment the followings in cassandra-env.sh.
 Done. Will post results as soon as I get a new stop-the-world gc.

  If you are unable to find a JIRA, file one

 Unless this turns out to be a problem on my end, I will.





Re: Reduce Cassandra GC

2013-06-15 Thread Takenori Sato
 INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line
122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is
1046937600

This says GC for New Generation took so long. And this is usually unlikely.

The only situation I am aware of is when a fairly large object is created,
and which can not be promoted to Old Generation because it requires such a
large *contiguous* memory space that is unavailable at the point in time.
This is called promotion failure. So it has to wait until concurrent
collector collects a large enough space. Thus you experience stop the
world. But I think it is not stop the world, but only stop the new world.

For example in case of Cassandra, a large number of
in_memory_compaction_limit_in_mb can cause this. This is a limit when a
compaction compacts(merges) rows of a key into the latest in memory. So
this creates a large byte array up to the number.

You can confirm this by enabling promotion failure GC logging in the
future, and by checking compactions executed at that point in time.



On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Jun 7, 2013 at 12:42 PM, Igor i...@4friends.od.ua wrote:
  If you are talking about 1.2.x then I also have memory problems on the
 idle
  cluster: java memory constantly slow grows up to limit, then spend long
 time
  for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle
  cluster java memory stay on the same value.

 If you are not aware of a pre-existing JIRA, I strongly encourage you to :

 1) Document your experience of this.
 2) Search issues.apache.org for anything that sounds similar.
 3) If you are unable to find a JIRA, file one.

 Thanks!

 =Rob



Re: Reduce Cassandra GC

2013-06-15 Thread Takenori Sato
Uncomment the followings in cassandra-env.sh.

JVM_OPTS=$JVM_OPTS -XX:+PrintGCDateStamps

JVM_OPTS=$JVM_OPTS -XX:+PrintPromotionFailure
JVM_OPTS=$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +%s`.log

* *Also can you take a heap dump at 2 diff points so that we can compare it?

No, I'm afraid. I ordinary use profiling tools, but am not aware of
anything that could respond during this event.



On Sun, Jun 16, 2013 at 4:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Can you paste you gc config? Also can you take a heap dump at 2 diff
 points so that we can compare it?

 Quick thing to do would be to do a histo live at 2 points and compare

 Sent from my iPhone

 On Jun 15, 2013, at 6:57 AM, Takenori Sato ts...@cloudian.com wrote:

  INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line
 122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is
 1046937600

 This says GC for New Generation took so long. And this is usually
 unlikely.

 The only situation I am aware of is when a fairly large object is created,
 and which can not be promoted to Old Generation because it requires such a
 large *contiguous* memory space that is unavailable at the point in time.
 This is called promotion failure. So it has to wait until concurrent
 collector collects a large enough space. Thus you experience stop the
 world. But I think it is not stop the world, but only stop the new world.

 For example in case of Cassandra, a large number of
 in_memory_compaction_limit_in_mb can cause this. This is a limit when a
 compaction compacts(merges) rows of a key into the latest in memory. So
 this creates a large byte array up to the number.

 You can confirm this by enabling promotion failure GC logging in the
 future, and by checking compactions executed at that point in time.



 On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli rc...@eventbrite.comwrote:

 On Fri, Jun 7, 2013 at 12:42 PM, Igor i...@4friends.od.ua wrote:
  If you are talking about 1.2.x then I also have memory problems on the
 idle
  cluster: java memory constantly slow grows up to limit, then spend long
 time
  for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle
  cluster java memory stay on the same value.

 If you are not aware of a pre-existing JIRA, I strongly encourage you to :

 1) Document your experience of this.
 2) Search issues.apache.org for anything that sounds similar.
 3) If you are unable to find a JIRA, file one.

 Thanks!

 =Rob





Re: Reduce Cassandra GC

2013-06-15 Thread Takenori Sato
 Also can you take a heap dump at 2 diff points so that we can compare it?

Also note that a promotion failure won't happen by a particular object, but
by a fragmentation in Old Generation space. So I am not sure if you can't
tell by a heap dump comparison.


On Sun, Jun 16, 2013 at 4:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Can you paste you gc config? Also can you take a heap dump at 2 diff
 points so that we can compare it?

 Quick thing to do would be to do a histo live at 2 points and compare

 Sent from my iPhone

 On Jun 15, 2013, at 6:57 AM, Takenori Sato ts...@cloudian.com wrote:

  INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line
 122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is
 1046937600

 This says GC for New Generation took so long. And this is usually
 unlikely.

 The only situation I am aware of is when a fairly large object is created,
 and which can not be promoted to Old Generation because it requires such a
 large *contiguous* memory space that is unavailable at the point in time.
 This is called promotion failure. So it has to wait until concurrent
 collector collects a large enough space. Thus you experience stop the
 world. But I think it is not stop the world, but only stop the new world.

 For example in case of Cassandra, a large number of
 in_memory_compaction_limit_in_mb can cause this. This is a limit when a
 compaction compacts(merges) rows of a key into the latest in memory. So
 this creates a large byte array up to the number.

 You can confirm this by enabling promotion failure GC logging in the
 future, and by checking compactions executed at that point in time.



 On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli rc...@eventbrite.comwrote:

 On Fri, Jun 7, 2013 at 12:42 PM, Igor i...@4friends.od.ua wrote:
  If you are talking about 1.2.x then I also have memory problems on the
 idle
  cluster: java memory constantly slow grows up to limit, then spend long
 time
  for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle
  cluster java memory stay on the same value.

 If you are not aware of a pre-existing JIRA, I strongly encourage you to :

 1) Document your experience of this.
 2) Search issues.apache.org for anything that sounds similar.
 3) If you are unable to find a JIRA, file one.

 Thanks!

 =Rob





Re: Cleanup understastanding

2013-05-29 Thread Takenori Sato
 But, that is still awkward. Does cleanup take so much disk space to
complete the compaction operation? In other words, twice the size?

Not really, but logically yes.

According to 1.0.7 source, cleanup checks if there's enough space that is
larger than the worst scenario as below. If not, the exception you got is
thrown.

/*
 * Add up all the files sizes this is the worst case file
 * size for compaction of all the list of files given.
 */
public long getExpectedCompactedFileSize(IterableSSTableReader
sstables)
{
long expectedFileSize = 0;
for (SSTableReader sstable : sstables)
{
long size = sstable.onDiskLength();
expectedFileSize = expectedFileSize + size;
}
return expectedFileSize;
}


On Wed, May 29, 2013 at 10:43 PM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.com wrote:

 Thanks for the answers.

 I got it. I was using cleanup, because I thought it would delete the
 tombstones.
 But, that is still awkward. Does cleanup take so much disk space to
 complete the compaction operation? In other words, twice the size?


 *Atenciosamente,*
 *Víctor Hugo Molinar - *@vhmolinar http://twitter.com/#!/vhmolinar


 On Tue, May 28, 2013 at 9:55 PM, Takenori Sato(Cloudian) 
 ts...@cloudian.com wrote:

  Hi Victor,

 As Andrey said, running cleanup doesn't work as you expect.


  The reason I need to clean things is that I wont need most of my
 inserted data on the next day.

 Deleted objects(columns/records) become deletable from sstable file when
 they get expired(after gc_grace_seconds).

 Such deletable objects are actually gotten rid of by compaction.

 The tricky part is that a deletable object remains unless all of its old
 objects(the same row key) are contained in the set of sstable files
 involved in the compaction.

 - Takenori


 (2013/05/29 3:01), Andrey Ilinykh wrote:

 cleanup removes data which doesn't belong to the current node. You have
 to run it only if you move (or add new) nodes. In your case there is no any
 reason to do it.


 On Tue, May 28, 2013 at 7:39 AM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

 Hello everyone.
 I have a daily maintenance task at c* which does:

 -truncate cfs
 -clearsnapshots
 -repair
 -cleanup

 The reason I need to clean things is that I wont need most of my
 inserted data on the next day. It's kind a business requirement.

 Well,  the problem I'm running to, is the misunderstanding about cleanup
 operation.
 I have 2 nodes with lower than half usage of disk, which is moreless
 13GB;

 But, the last few days, arbitrarily each node have reported me a cleanup
 error indicating that the disk was full. Which is not true.

 *Error occured during cleanup*
 *java.util.concurrent.ExecutionException: java.io.IOException: disk full
 *


  So I'd like to know more about what does happens in a cleanup
 operation.
 Appreciate any help.







Re: Cleanup understastanding

2013-05-28 Thread Takenori Sato(Cloudian)

Hi Victor,

As Andrey said, running cleanup doesn't work as you expect.

 The reason I need to clean things is that I wont need most of my 
inserted data on the next day.


Deleted objects(columns/records) become deletable from sstable file when 
they get expired(after gc_grace_seconds).


Such deletable objects are actually gotten rid of by compaction.

The tricky part is that a deletable object remains unless all of its old 
objects(the same row key) are contained in the set of sstable files 
involved in the compaction.


- Takenori

(2013/05/29 3:01), Andrey Ilinykh wrote:
cleanup removes data which doesn't belong to the current node. You 
have to run it only if you move (or add new) nodes. In your case there 
is no any reason to do it.



On Tue, May 28, 2013 at 7:39 AM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.com mailto:vhmoli...@gmail.com wrote:


Hello everyone.
I have a daily maintenance task at c* which does:

-truncate cfs
-clearsnapshots
-repair
-cleanup

The reason I need to clean things is that I wont need most of my
inserted data on the next day. It's kind a business requirement.

Well,  the problem I'm running to, is the misunderstanding about
cleanup operation.
I have 2 nodes with lower than half usage of disk, which is
moreless 13GB;

But, the last few days, arbitrarily each node have reported me a
cleanup error indicating that the disk was full. Which is not true.

/Error occured during cleanup/
/java.util.concurrent.ExecutionException: java.io.IOException:
disk full/


So I'd like to know more about what does happens in a cleanup
operation.
Appreciate any help.






Re: CPU hotspot at BloomFilterSerializer#deserialize

2013-02-05 Thread Takenori Sato(Cloudian)

Hi,

We found this issue is specific to 1.0.1 through 1.0.8, which was fixed 
at 1.0.9.


https://issues.apache.org/jira/browse/CASSANDRA-4023

So by upgrading, we will see a reasonable performnace no matter how 
large row we have!


Thanks,
Takenori

(2013/02/05 2:29), aaron morton wrote:
Yes, it contains a big row that goes up to 2GB with more than a 
million of columns.
I've run tests with 10 million small columns and reasonable 
performance. I've not looked at 1 million large columns.


- BloomFilterSerializer#deserialize does readLong iteratively at 
each page

of size 4K for a given row, which means it could be 500,000 loops(calls
readLong) for a 2G row(from 1.0.7 source).
There is only one Bloom filter per row in an SSTable, not one per 
column index/page.


It could take a while if there are a lot of sstables in the read.

nodetool cfhistorgrams will let you know, run it once to reset the 
counts , then do your test, then run it again.


Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/02/2013, at 4:13 AM, Edward Capriolo edlinuxg...@gmail.com 
mailto:edlinuxg...@gmail.com wrote:



It is interesting the press c* got about having 2 billion columns in a
row. You *can* do it but it brings to light some realities of what
that means.

On Sun, Feb 3, 2013 at 8:09 AM, Takenori Sato ts...@cloudian.com 
mailto:ts...@cloudian.com wrote:

Hi Aaron,

Thanks for your answers. That helped me get a big picture.

Yes, it contains a big row that goes up to 2GB with more than a 
million of

columns.

Let me confirm if I correctly understand.

- The stack trace is from Slice By Names query. And the 
deserialization is

at the step 3, Read the row level Bloom Filter, on your blog.

- BloomFilterSerializer#deserialize does readLong iteratively at 
each page

of size 4K for a given row, which means it could be 500,000 loops(calls
readLong) for a 2G row(from 1.0.7 source).

Correct?

That makes sense Slice By Names queries against such a wide row 
could be CPU

bottleneck. In fact, in our test environment, a
BloomFilterSerializer#deserialize of such a case takes more than 
10ms, up to

100ms.


Get a single named column.
Get the first 10 columns using the natural column order.
Get the last 10 columns using the reversed order.


Interesting. A query pattern could make a difference?

We thought the only solutions is to change the data structure(don't 
use such

a wide row if it is retrieved by Slice By Names query).

Anyway, will give it a try!

Best,
Takenori

On Sat, Feb 2, 2013 at 2:55 AM, aaron morton 
aa...@thelastpickle.com mailto:aa...@thelastpickle.com

wrote:


5. the problematic Data file contains only 5 to 10 keys data but
large(2.4G)

So very large rows ?
What does nodetool cfstats or cfhistograms say about the row sizes ?


1. what is happening?

I think this is partially large rows and partially the query 
pattern, this

is only by roughly correct
http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my 
talk here

http://www.datastax.com/events/cassandrasummit2012/presentations

3. any more info required to proceed?

Do some tests with different query techniques…

Get a single named column.
Get the first 10 columns using the natural column order.
Get the last 10 columns using the reversed order.

Hope that helps.

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 31/01/2013, at 7:20 PM, Takenori Sato ts...@cloudian.com wrote:

Hi all,

We have a situation that CPU loads on some of our nodes in a 
cluster has
spiked occasionally since the last November, which is triggered by 
requests

for rows that reside on two specific sstables.

We confirmed the followings(when spiked):

version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8
jdk: Oracle 1.6.0

1. a profiling showed that BloomFilterSerializer#deserialize was the
hotspot(70% of the total load by running threads)

* the stack trace looked like this(simplified)
90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
...
90.4% - 
org.apache.cassandra.db.CollationController.collectTimeOrderedData

...
89.5% - 
org.apache.cassandra.db.columniterator.SSTableNamesIterator.read

...
79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter
68.9% - 
org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize

66.7% - java.io.DataInputStream.readLong

2. Usually, 1 should be so fast that a profiling by sampling can not
detect

3. no pressure on Cassandra's VM heap nor on machine in overal

4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by 
iostat

1 1000)

5. the problematic Data file contains only 5 to 10 keys data but
large(2.4G)

6. the problematic Filter file size is only 256B(could be normal)


So now, I am trying to read the Filter file in the same way
BloomFilterSerializer#deserialize does as possible as I can

Re: CPU hotspot at BloomFilterSerializer#deserialize

2013-02-03 Thread Takenori Sato
Hi Aaron,

Thanks for your answers. That helped me get a big picture.

Yes, it contains a big row that goes up to 2GB with more than a million of
columns.

Let me confirm if I correctly understand.

- The stack trace is from Slice By Names query. And the deserialization is
at the step 3, Read the row level Bloom Filter, on your blog.

- BloomFilterSerializer#deserialize does readLong iteratively at each page
of size 4K for a given row, which means it could be 500,000 loops(calls
readLong) for a 2G row(from 1.0.7 source).

Correct?

That makes sense Slice By Names queries against such a wide row could be
CPU bottleneck. In fact, in our test environment, a
BloomFilterSerializer#deserialize of such a case takes more than 10ms, up
to 100ms.

 Get a single named column.
 Get the first 10 columns using the natural column order.
 Get the last 10 columns using the reversed order.

Interesting. A query pattern could make a difference?

We thought the only solutions is to change the data structure(don't use
such a wide row if it is retrieved by Slice By Names query).

Anyway, will give it a try!

Best,
Takenori

On Sat, Feb 2, 2013 at 2:55 AM, aaron morton aa...@thelastpickle.comwrote:

 5. the problematic Data file contains only 5 to 10 keys data but
 large(2.4G)

 So very large rows ?
 What does nodetool cfstats or cfhistograms say about the row sizes ?


 1. what is happening?

 I think this is partially large rows and partially the query pattern, this
 is only by roughly correct
 http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my talk
 here http://www.datastax.com/events/cassandrasummit2012/presentations

 3. any more info required to proceed?

 Do some tests with different query techniques…

 Get a single named column.
 Get the first 10 columns using the natural column order.
 Get the last 10 columns using the reversed order.

 Hope that helps.

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 31/01/2013, at 7:20 PM, Takenori Sato ts...@cloudian.com wrote:

 Hi all,

 We have a situation that CPU loads on some of our nodes in a cluster has
 spiked occasionally since the last November, which is triggered by requests
 for rows that reside on two specific sstables.

 We confirmed the followings(when spiked):

 version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8
 jdk: Oracle 1.6.0

 1. a profiling showed that BloomFilterSerializer#deserialize was the
 hotspot(70% of the total load by running threads)

 * the stack trace looked like this(simplified)
 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
 ...
 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData
 ...
 89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read
 ...
 79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter
 68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize
 66.7% - java.io.DataInputStream.readLong

 2. Usually, 1 should be so fast that a profiling by sampling can not detect

 3. no pressure on Cassandra's VM heap nor on machine in overal

 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat
 1 1000)

 5. the problematic Data file contains only 5 to 10 keys data but
 large(2.4G)

 6. the problematic Filter file size is only 256B(could be normal)


 So now, I am trying to read the Filter file in the same way
 BloomFilterSerializer#deserialize does as possible as I can, in order to
 see if the file is something wrong.

 Could you give me some advise on:

 1. what is happening?
 2. the best way to simulate the BloomFilterSerializer#deserialize
 3. any more info required to proceed?

 Thanks,
 Takenori





CPU hotspot at BloomFilterSerializer#deserialize

2013-01-30 Thread Takenori Sato
Hi all,

We have a situation that CPU loads on some of our nodes in a cluster has
spiked occasionally since the last November, which is triggered by requests
for rows that reside on two specific sstables.

We confirmed the followings(when spiked):

version: 1.0.7(current) - 0.8.6 - 0.8.5 - 0.7.8
jdk: Oracle 1.6.0

1. a profiling showed that BloomFilterSerializer#deserialize was the
hotspot(70% of the total load by running threads)

* the stack trace looked like this(simplified)
90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
...
90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData
...
89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read
...
79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter
68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize
66.7% - java.io.DataInputStream.readLong

2. Usually, 1 should be so fast that a profiling by sampling can not detect

3. no pressure on Cassandra's VM heap nor on machine in overal

4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by iostat 1
1000)

5. the problematic Data file contains only 5 to 10 keys data but large(2.4G)

6. the problematic Filter file size is only 256B(could be normal)


So now, I am trying to read the Filter file in the same way
BloomFilterSerializer#deserialize does as possible as I can, in order to
see if the file is something wrong.

Could you give me some advise on:

1. what is happening?
2. the best way to simulate the BloomFilterSerializer#deserialize
3. any more info required to proceed?

Thanks,
Takenori