from:"Bryan Talbot"

Re: tuning concurrent_reads param

2014-11-06 Thread Bryan Talbot

On Wed, Nov 5, 2014 at 11:00 PM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 Sorry I have late follow up question 

 In the Cassandra.yaml file the concurrent_read section has the following
 comment:

 What does it mean by  the operations to enqueue low enough in the stack
 that the OS and drives can reorder them. ? how does it help making the
 system healthy?


The operating system, disk controllers, and disks themselves can merge and
reorder requests to optimize performance.

Here's a relevant page with some details if you're interested in more
http://www.makelinux.net/books/lkd2/ch13lev1sec5



 What really happen if we increase it to a too high value? (maybe affecting
 other read or write operation as it eat up all disk IO resource?)



Yes

-Bryan

Re: new data not flushed to sstables

2014-11-03 Thread Bryan Talbot

On Mon, Nov 3, 2014 at 7:44 AM, Sebastian Martinka 
sebastian.marti...@mercateo.com wrote:

  System and Keyspace Information:

 4 Nodes



 CREATE KEYSPACE restore_test WITH replication = {  'class':
 'SimpleStrategy',

   'replication_factor': '3'};





 I assumed,  that a flush write all data in the sstables and we can use it
 for backup and restore. Did I forget something or is my understanding
 wrong?



I think you forgot that with N=4 and RF=3 that each node will contain
approximately 75% of the data. From a quick eyeball check of the json-dump
you provided, it looks like partition-key values are contained on 3 nodes
and are absent from 1 which is exactly as expected.

-Bryan

Re: OldGen saturation

2014-10-28 Thread Bryan Talbot

On Tue, Oct 28, 2014 at 9:02 AM, Adria Arcarons 
adria.arcar...@greenpowermonitor.com wrote:

  Hi,

 Hi





 We have about 50.000 CFs of varying size







 The writing test consists of a continuous flow of inserts. The inserts are
 done inside BATCH statements in groups of 1.000 to a single CF at a time to
 make them faster.






 The problem I’m experiencing is that, eventually, when the script has been
 running for almost 40mins, the heap gets saturated. OldGen gets full and
 then there is an intensive GC activity trying to free OldGen objects, but
 it can only free very little space in each pass. Then GC saturates the CPU.
 Here are the graphs obtained with VisualVM that show this behavior:





 My total heap size is 1GB and the the NewGen region of 256MB. The C* node
 has 4GB RAM. Intel Xeon CPU E5520 @



Without looking at your VM graphs, I'm going to go out on a limb here and
say that your host is woefully underpowered to host fifty-thousand column
families and batch writes of one-thousand statements.

A 1 GB java heap size is sometimes acceptable for a unit test or playing
around with but you can't actually expect it to be adequate for a load test
can you?

Every CF consumes some permanent heap space for its metadata. Too many CF
are a bad thing. You probably have ten times more CF than would be
recommended as an upper limit.

-Bryan

Re: Repair taking long time

2014-09-26 Thread Bryan Talbot

With a 4.5 TB table and just 4 nodes, repair will likely take forever for
any version.

-Bryan


On Fri, Sep 26, 2014 at 10:40 AM, Jonathan Haddad j...@jonhaddad.com wrote:

 Are you using Cassandra 2.0  vnodes?  If so, repair takes forever.
 This problem is addressed in 2.1.

 On Fri, Sep 26, 2014 at 9:52 AM, Gene Robichaux
 gene.robich...@match.com wrote:
  I am fairly new to Cassandra. We have a 9 node cluster, 5 in one DC and
 4 in
  another.
 
 
 
  Running a repair on a large column family seems to be moving much slower
  than I expect.
 
 
 
  Looking at nodetool compaction stats it indicates the Validation phase is
  running that the total bytes is 4.5T (4505336278756).
 
 
 
  This is a very large CF. The process has been running for 2.5 hours and
 has
  processed 71G (71950433062). That rate is about 28.4 GB per hour. At this
  rate it will take 158 hours, just shy of 1 week.
 
 
 
  Is this reasonable? This is my first large repair and I am wondering if
 this
  is normal for a CF of this size. Seems like a long time to me.
 
 
 
  Is it possible to tune this process to speed it up? Is there something
 in my
  configuration that could be causing this slow performance? I am running
  HDDs, not SSDs in a JBOD configuration.
 
 
 
 
 
 
 
  Gene Robichaux
 
  Manager, Database Operations
 
  Match.com
 
  8300 Douglas Avenue I Suite 800 I Dallas, TX  75225
 
  Phone: 214-576-3273
 
 



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 twitter: rustyrazorblade

Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-19 Thread Bryan Talbot

I think there are several issues in your schema and queries.

First, the schema can't efficiently return the single newest post for every
author. It can efficiently return the newest N posts for a particular
author.

On Fri, May 16, 2014 at 11:53 PM, 後藤 泰陽 matope@gmail.com wrote:


 But I consider LIMIT to be a keyword to limits result numbers from WHOLE
 results retrieved by the SELECT statement.



This is happening due to the incorrect use of minTimeuuid() function. All
of your created_at values are equal so you're essentially getting 2 (order
not defined) values that have the lowest created_at value.

The minTimeuuid() function is mean to be used in the WHERE clause of a
SELECT statement often with maxTimeuuid() to do BETWEEN sort of queries on
timeuuid values.




 The result with SELECT.. LIMIT is below. Unfortunately, This is not what I
 wanted.
 I wante latest posts of each authors. (Now I doubt if CQL3 can't represent
 it)

 cqlsh:blog_test create table posts(
  ... author ascii,
  ... created_at timeuuid,
  ... entry text,
  ... primary key(author,created_at)
  ... )WITH CLUSTERING ORDER BY (created_at DESC);
 cqlsh:blog_test
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 john');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
 mike');
 cqlsh:blog_test insert into posts(author,created_at,entry) values
 ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
 mike');
 cqlsh:blog_test select * from posts limit 2;

  author | created_at   | entry

 +--+--
mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
 mike
mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
 mike






To get most recent posts by a particular author, you'll need statements
more like this:

cqlsh:test insert into posts(author,created_at,entry) values
('john',now(),'This is an old entry by john'); cqlsh:test insert into
posts(author,created_at,entry) values ('john',now(),'This is a new entry by
john'); cqlsh:test insert into posts(author,created_at,entry) values
('mike',now(),'This is an old entry by mike'); cqlsh:test insert into
posts(author,created_at,entry) values ('mike',now(),'This is a new entry by
mike');

and then you can get posts by 'john' ordered by newest to oldest as:

cqlsh:test select author, created_at, dateOf(created_at), entry from posts
where author = 'john' limit 2 ;

 author | created_at   | dateOf(created_at)   |
entry
+--+--+--
   john | 7cb1ac30-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:36-0700 |
 This is a new entry by john
   john | 74bb6750-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:23-0700 |
This is an old entry by john


-Bryan

Re: Best partition type for Cassandra with JBOD

2014-05-19 Thread Bryan Talbot

For XFS, using noatime and nodirtime isn't really useful either.

http://xfs.org/index.php/XFS_FAQ#Q:_Is_using_noatime_or.2Fand_nodiratime_at_mount_time_giving_any_performance_benefits_in_xfs_.28or_not_using_them_performance_decrease.29.3F

On Sat, May 17, 2014 at 7:52 AM, James Campbell
ja...@breachintelligence.com wrote:

Thanks for the thoughts!
On May 16, 2014 4:23 PM, Ariel Weisberg ar...@weisberg.ws wrote:
Hi,

Recommending nobarrier (mount option barrier=0) when you don't know if a
non-volatile cache in play is probably not the way to go. A non-volatile
cache will typically ignore write barriers if a given block device is
configured to cache writes anyways.

I am also skeptical you will see a boost in performance. Applications that
want to defer and batch writes won't emit write barriers frequently and
when they do it's because the data has to be there. Filesystems depend on
write barriers although it is surprisingly hard to get a reordering that is
really bad because of the way journals are managed.

Cassandra uses log structured storage and supports asynchronous periodic
group commit so it doesn't need to emit write barriers frequently.

Setting read ahead to zero on an SSD is necessary to get the maximum
number of random reads, but will also disable prefetching for sequential
reads. You need a lot less prefetching with an SSD due to the much faster
response time, but it's still many microseconds.

Someone with more Cassandra specific knowledge can probably give better
advice as to when a non-zero read ahead make sense with Cassandra. This is
something may be workload specific as well.

Regards,
Ariel

On Fri, May 16, 2014, at 01:55 PM, Kevin Burton wrote:

That and nobarrier… and probably noop for the scheduler if using SSD and
setting readahead to zero...

On Fri, May 16, 2014 at 10:29 AM, James Campbell
ja...@breachintelligence.com wrote:

Hi all—

What partition type is best/most commonly used for a multi-disk JBOD setup
running Cassandra on CentOS 64bit?

The datastax production server guidelines recommend XFS for data
partitions, saying, “Because Cassandra can use almost half your disk space
for a single file, use XFS when using large disks, particularly if using a
32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and
essentially unlimited on 64-bit.”

However, the same document also notes that “Maximum recommended capacity
for Cassandra 1.2 and later is 3 to 5TB per node,” which makes me think
16TB file sizes would be irrelevant (especially when not using RAID to
create a single large volume). What has been the experience of this group?

I also noted that the guidelines don’t mention setting noatime and
nodiratime flags in the fstab for data volumes, but I wonder if that’s a
common practice.

James

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profilehttps://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations
are people.

--
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: Index with same Name but different keyspace

2014-05-19 Thread Bryan Talbot

On Mon, May 19, 2014 at 6:39 AM, mahesh rajamani
rajamani.mah...@gmail.comwrote:

 Sorry I just realized the table name in 2 schema are slightly different,
 but still i am not sure why i should not use same index name across
 different schema. Below is the instruction to reproduce.


 Created 2 keyspace using cassandra-cli


 [default@unknown] create keyspace keyspace1 with placement_strategy =
 'org.apache.cassandra.locator.SimpleStrategy' and
 strategy_options={replication_factor:1};

 [default@unknown] create keyspace keyspace2 with placement_strategy =
 'org.apache.cassandra.locator.SimpleStrategy' and
 strategy_options={replication_factor:1};


 Create table index using cqlsh as below:


 cqlsh use keyspace1;

 cqlsh:keyspace1 CREATE TABLE table1 (version text, flag boolean, primary
 key (version));

 cqlsh:keyspace1 create index sversionindex on table1(flag);

 cqlsh:keyspace1 use keyspace2;

 cqlsh:keyspace2 CREATE TABLE table2 (version text, flag boolean, primary
 key (version));

 cqlsh:keyspace2 create index sversionindex on table2(flag);

 *Bad Request: Duplicate index name sversionindex*



Since index name is optional in the create index statement, you could just
omit it and let the system give it a unique name for you.

-Bryan

Failed to mkdirs $HOME/.cassandra

2014-05-15 Thread Bryan Talbot

How should nodetool command be run as the user nobody?

The nodetool command fails with an exception if it cannot create a
.cassandra directory in the current user's home directory.

I'd like to schedule some nodetool commands to run with least privilege as
cron jobs. I'd like to run them as the nobody user -- which typically has
/ as the home directory -- since that's what the user is typically used
for (minimum privileges).

None of the methods described in this JIRA actually seem to work (with
2.0.7 anyway) https://issues.apache.org/jira/browse/CASSANDRA-6475

Testing as a normal user with no write permissions to the home directory
(to simulate the nobody user)

[vagrant@local-dev ~]$ nodetool version
ReleaseVersion: 2.0.7
[vagrant@local-dev ~]$ rm -rf .cassandra/
[vagrant@local-dev ~]$ chmod a-w .

[vagrant@local-dev ~]$ nodetool flush my_ks my_cf
Exception in thread main FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ HOME=/tmp nodetool flush my_ks my_cf
Exception in thread main FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ env HOME=/tmp nodetool flush my_ks my_cf
Exception in thread main FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ env user.home=/tmp nodetool flush my_ks my_cf
Exception in thread main FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ nodetool -Duser.home=/tmp flush my_ks my_cf
Unrecognized option: -Duser.home=/tmp
usage: java org.apache.cassandra.tools.NodeCmd --host arg command
...

Re: Cassandra 2.0.7 always failes due to 'too may open files' error

2014-05-05 Thread Bryan Talbot

Running

# cat /proc/$(cat /var/run/cassandra.pid)/limits

as root or your cassandra user will tell you what limits it's actually
running with.




On Sun, May 4, 2014 at 10:12 PM, Yatong Zhang bluefl...@gmail.com wrote:

 I am running 'repair' when the error occurred. And just a few days before
 I changed the compaction strategy to 'leveled'. don know if this helps


 On Mon, May 5, 2014 at 1:10 PM, Yatong Zhang bluefl...@gmail.com wrote:

 Cassandra is running as root

 [root@storage5 ~]# ps aux | grep java
 root  1893 42.0 24.0 7630664 3904000 ? Sl   10:43  60:01 java
 -ea -javaagent:/mydb/cassandra/bin/../lib/jamm-0.2.5.jar
 -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities
 -XX:ThreadPriorityPolicy=42 -Xms3959M -Xmx3959M -Xmn400M
 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103
 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
 -XX:+UseTLAB -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true
 -Dcom.sun.management.jmxremote.port=7199
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false
 -Dlog4j.configuration=log4j-server.properties
 -Dlog4j.defaultInitOverride=true -Dcassandra-pidfile=/var/run/cassandra.pid
 -cp
 /mydb/cassandra/bin/../conf:/mydb/cassandra/bin/../build/classes/main:/mydb/cassandra/bin/../build/classes/thrift:/mydb/cassandra/bin/../lib/antlr-3.2.jar:/mydb/cassandra/bin/../lib/apache-cassandra-2.0.7.jar:/mydb/cassandra/bin/../lib/apache-cassandra-clientutil-2.0.7.jar:/mydb/cassandra/bin/../lib/apache-cassandra-thrift-2.0.7.jar:/mydb/cassandra/bin/../lib/commons-cli-1.1.jar:/mydb/cassandra/bin/../lib/commons-codec-1.2.jar:/mydb/cassandra/bin/../lib/commons-lang3-3.1.jar:/mydb/cassandra/bin/../lib/compress-lzf-0.8.4.jar:/mydb/cassandra/bin/../lib/concurrentlinkedhashmap-lru-1.3.jar:/mydb/cassandra/bin/../lib/disruptor-3.0.1.jar:/mydb/cassandra/bin/../lib/guava-15.0.jar:/mydb/cassandra/bin/../lib/high-scale-lib-1.1.2.jar:/mydb/cassandra/bin/../lib/jackson-core-asl-1.9.2.jar:/mydb/cassandra/bin/../lib/jackson-mapper-asl-1.9.2.jar:/mydb/cassandra/bin/../lib/jamm-0.2.5.jar:/mydb/cassandra/bin/../lib/jbcrypt-0.3m.jar:/mydb/cassandra/bin/../lib/jline-1.0.jar:/mydb/cassandra/bin/../lib/json-simple-1.1.jar:/mydb/cassandra/bin/../lib/libthrift-0.9.1.jar:/mydb/cassandra/bin/../lib/log4j-1.2.16.jar:/mydb/cassandra/bin/../lib/lz4-1.2.0.jar:/mydb/cassandra/bin/../lib/metrics-core-2.2.0.jar:/mydb/cassandra/bin/../lib/netty-3.6.6.Final.jar:/mydb/cassandra/bin/../lib/reporter-config-2.1.0.jar:/mydb/cassandra/bin/../lib/servlet-api-2.5-20081211.jar:/mydb/cassandra/bin/../lib/slf4j-api-1.7.2.jar:/mydb/cassandra/bin/../lib/slf4j-log4j12-1.7.2.jar:/mydb/cassandra/bin/../lib/snakeyaml-1.11.jar:/mydb/cassandra/bin/../lib/snappy-java-1.0.5.jar:/mydb/cassandra/bin/../lib/snaptree-0.1.jar:/mydb/cassandra/bin/../lib/super-csv-2.1.0.jar:/mydb/cassandra/bin/../lib/thrift-server-0.3.3.jar
 org.apache.cassandra.service.CassandraDaemon




 On Mon, May 5, 2014 at 1:02 PM, Philip Persad philip.per...@gmail.comwrote:

 Have you tried running ulimit -a as the Cassandra user instead of as
 root? It is possible that your configured a high file limit for root but
 not for the user running the Cassandra process.


 On Sun, May 4, 2014 at 6:07 PM, Yatong Zhang bluefl...@gmail.comwrote:

 [root@storage5 ~]# lsof -n | grep java | wc -l
 5103
 [root@storage5 ~]# lsof | wc -l
 6567


 It's mentioned in previous mail:)


 On Mon, May 5, 2014 at 9:03 AM, nash nas...@gmail.com wrote:

 The lsof command or /proc can tell you how many open files it has. How
 many is it?

 --nash

Re: using cssandra cql with php

2014-03-04 Thread Bryan Talbot

I think the options for using CQL from PHP pretty much don't exist. Those
that do are very old, haven't been updated in months, and don't support
newer CQL features. Also I don't think any of them use the binary protocol
but use thrift instead.

From what I can tell, you'll be stuck using old CQL features from
unmaintained client drivers -- probably better to not be using CQL and PHP
together since mixing them seems pretty bad right now.


-Bryan



On Sun, Jan 12, 2014 at 11:27 PM, Jason Wee peich...@gmail.com wrote:

 Hi,

 operating system should not be a matter right? You just need the cassandra
 client downloaded and use it to access cassandra node. PHP?
 http://wiki.apache.org/cassandra/ClientOptions perhaps you can package
 cassandra pdo driver into rpm?

 Jason


 On Mon, Jan 13, 2014 at 3:02 PM, Tim Dunphy bluethu...@gmail.com wrote:

 Hey all,

  I'd like to be able to make calls to the cassandra database using PHP.
 I've taken a look around but I've only found solutions out there for Ubuntu
 and other distros. But my environment is CentOS.  Are there any packages
 out there I can install that would allow me to use CQL in my PHP code?

 Thanks
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B

Re: Heap is not released and streaming hangs at 0%

2013-06-21 Thread Bryan Talbot

bloom_filter_fp_chance = 0.7 is probably way too large to be effective and
you'll probably have issues compacting deleted rows and get poor read
performance with a value that high.  I'd guess that anything larger than
0.1 might as well be 1.0.

-Bryan



On Fri, Jun 21, 2013 at 5:58 AM, srmore comom...@gmail.com wrote:


 On Fri, Jun 21, 2013 at 2:53 AM, aaron morton aa...@thelastpickle.comwrote:

  nodetool -h localhost flush didn't do much good.

 Do you have 100's of millions of rows ?
 If so see recent discussions about reducing the bloom_filter_fp_chance
 and index_sampling.

 Yes, I have 100's of millions of rows.



 If this is an old schema you may be using the very old setting of
 0.000744 which creates a lot of bloom filters.

 bloom_filter_fp_chance value that was changed from default to 0.1, looked
 at the filters and they are about 2.5G on disk and I have around 8G of heap.
 I will try increasing the value to 0.7 and report my results.

 It also appears to be a case of hard GC failure (as Rob mentioned) as the
 heap is never released, even after 24+ hours of idle time, the JVM needs to
 be restarted to reclaim the heap.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 20/06/2013, at 6:36 AM, Wei Zhu wz1...@yahoo.com wrote:

 If you want, you can try to force the GC through Jconsole.
 Memory-Perform GC.

 It theoretically triggers a full GC and when it will happen depends on
 the JVM

 -Wei

 --
 *From: *Robert Coli rc...@eventbrite.com
 *To: *user@cassandra.apache.org
 *Sent: *Tuesday, June 18, 2013 10:43:13 AM
 *Subject: *Re: Heap is not released and streaming hangs at 0%

 On Tue, Jun 18, 2013 at 10:33 AM, srmore comom...@gmail.com wrote:
  But then shouldn't JVM C G it eventually ? I can still see Cassandra
 alive
  and kicking but looks like the heap is locked up even after the traffic
 is
  long stopped.

 No, when GC system fails this hard it is often a permanent failure
 which requires a restart of the JVM.

  nodetool -h localhost flush didn't do much good.

 This adds support to the idea that your heap is too full, and not full
 of memtables.

 You could try nodetool -h localhost invalidatekeycache, but that
 probably will not free enough memory to help you.

 =Rob

Re: Compaction not running

2013-06-18 Thread Bryan Talbot

Manual compaction for LCS doesn't really do much.  It certainly doesn't
compact all those little files into bigger files.  What makes you think
that compactions are not occurring?

-Bryan



On Tue, Jun 18, 2013 at 3:59 PM, Franc Carter franc.car...@sirca.org.auwrote:

 On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter 
 franc.car...@sirca.org.auwrote:

 On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli rc...@eventbrite.comwrote:

 On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter franc.car...@sirca.org.au
 wrote:
  We are running a test system with Leveled compaction on
 Cassandra-1.2.4.
  While doing an initial load of the data one of the nodes ran out of
 file
  descriptors and since then it hasn't been automatically compacting.

 You have (at least) two options :

 1) increase file descriptors available to Cassandra with ulimit, if
 possible
 2) increase the size of your sstables with levelled compaction, such
 that you have fewer of them


 Oops, I wasn't clear enough.

 I have increased the number of file descriptors and no longer have a file
 descriptor issue. However the node still doesn't compact automatically. If
 I run a 'nodetool compact' it will do a small amount of compaction and then
 stop. The Column Family is using LCS


 Any ideas on this - compaction is still not automatically running for one
 of my nodes

 thanks



 cheers



 =Rob




 --

 *Franc Carter* | Systems architect | Sirca Ltd
  marc.zianideferra...@sirca.org.au

 franc.car...@sirca.org.au | www.sirca.org.au

 Tel: +61 2 8355 2514

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215





 --

 *Franc Carter* | Systems architect | Sirca Ltd
  marc.zianideferra...@sirca.org.au

 franc.car...@sirca.org.au | www.sirca.org.au

 Tel: +61 2 8355 2514

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

Re: [Cassandra] Conflict resolution in Cassandra

2013-06-06 Thread Bryan Talbot

For generic questions like this, google is your friend:
http://lmgtfy.com/?q=cassandra+conflict+resolution

-Bryan


On Thu, Jun 6, 2013 at 11:23 AM, Emalayan Vairavanathan 
svemala...@yahoo.com wrote:

 Hi All,

 Can someone tell me about the conflict resolution mechanisms provided by
 Cassandra?

 More specifically does Cassandra provides a way to define application
 specific conflict resolution mechanisms (per row basis  / column basis)?
or
 Does it automatically manage the conflicts based on some synchronization
 algorithms ?


 Thank you
 Emalayan

Re: Multiple JBOD data directory

2013-06-05 Thread Bryan Talbot

If you're using cassandra 1.2 then you have a choice specified in the yaml


# policy for data disk failures:
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
#   can still be inspected via JMX.
# best_effort: stop using the failed disk and respond to requests based on
#  remaining available sstables.  This means you WILL see
obsolete
#  data at CL.ONE!
# ignore: ignore fatal errors a


-Bryan



On Wed, Jun 5, 2013 at 6:11 AM, Christopher Wirt chris.w...@struq.comwrote:

 I would hope so. Just trying to get some confirmation from someone with
 production experience. 

 ** **

 Thanks for your reply

 ** **

 *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
 *Sent:* 05 June 2013 13:31
 *To:* user@cassandra.apache.org
 *Subject:* Re: Multiple JBOD data directory

 ** **

 Though, I am a newbie bust just had a thought regarding your question 'How
 will it handle requests for data which unavailable?', wouldn't the data be
 served in that case from other nodes where it has been replicated?

 ** **

 Regards,

 Shahab

 ** **

 On Wed, Jun 5, 2013 at 5:32 AM, Christopher Wirt chris.w...@struq.com
 wrote:

 Hello, 

  

 We’re thinking about using multiple data directories each with its own
 disk and are currently testing this against a RAID0 config.

  

 I’ve seen that there is failure handling with multiple JBOD.

  

 e.g. 

 We have two data directories mounted to separate drives

 /disk1

 /disk2

 One of the drives fails 

  

 Will Cassandra continue to work?

 How will it handle requests for data which unavailable?

 If I want to add an additional drive what is the best way to go about
 redistributing the data? 

  

 Thanks,

 Chris

 ** **

Re: Multiple JBOD data directory

2013-06-05 Thread Bryan Talbot

... sorry, message got cut off


# policy for data disk failures:
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
#   can still be inspected via JMX.
# best_effort: stop using the failed disk and respond to requests based on
#  remaining available sstables.  This means you WILL see
obsolete
#  data at CL.ONE!
# ignore: ignore fatal errors and let requests fail, as in pre-1.2 Cassandra
disk_failure_policy: stop





On Wed, Jun 5, 2013 at 2:59 PM, Bryan Talbot btal...@aeriagames.com wrote:

 If you're using cassandra 1.2 then you have a choice specified in the yaml


 # policy for data disk failures:
 # stop: shut down gossip and Thrift, leaving the node effectively dead, but
 #   can still be inspected via JMX.
 # best_effort: stop using the failed disk and respond to requests based on
 #  remaining available sstables.  This means you WILL see
 obsolete
 #  data at CL.ONE!
 # ignore: ignore fatal errors a


 -Bryan



 On Wed, Jun 5, 2013 at 6:11 AM, Christopher Wirt chris.w...@struq.comwrote:

 I would hope so. Just trying to get some confirmation from someone with
 production experience. 

 ** **

 Thanks for your reply

 ** **

 *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
 *Sent:* 05 June 2013 13:31
 *To:* user@cassandra.apache.org
 *Subject:* Re: Multiple JBOD data directory

 ** **

 Though, I am a newbie bust just had a thought regarding your question 'How
 will it handle requests for data which unavailable?', wouldn't the data be
 served in that case from other nodes where it has been replicated?

 ** **

 Regards,

 Shahab

 ** **

 On Wed, Jun 5, 2013 at 5:32 AM, Christopher Wirt chris.w...@struq.com
 wrote:

 Hello, 

  

 We’re thinking about using multiple data directories each with its own
 disk and are currently testing this against a RAID0 config.

  

 I’ve seen that there is failure handling with multiple JBOD.

  

 e.g. 

 We have two data directories mounted to separate drives

 /disk1

 /disk2

 One of the drives fails 

  

 Will Cassandra continue to work?

 How will it handle requests for data which unavailable?

 If I want to add an additional drive what is the best way to go about
 redistributing the data? 

  

 Thanks,

 Chris

 ** **

Re: Cassandra performance decreases drastically with increase in data size.

2013-05-30 Thread Bryan Talbot

One or more of these might be effective depending on your particular usage

- remove data (rows especially)
- add nodes
- add ram (has limitations)
- reduce bloom filter space used by increasing fp chance
- reduce row and key cache sizes
- increase index sample ratio
- reduce compaction concurrency and throughput
- upgrade to cassandra 1.2 which does some of these things for you


-Bryan



On Thu, May 30, 2013 at 2:31 PM, srmore comom...@gmail.com wrote:

 You are right, it looks like I am doing a lot of GC. Is there any
 short-term solution for this other than bumping up the heap ? because, even
 if I increase the heap I will run into the same issue. Only the time before
 I hit OOM will be lengthened.

 It will be while before we go to latest and greatest Cassandra.

 Thanks !



 On Thu, May 30, 2013 at 12:05 AM, Jonathan Ellis jbel...@gmail.comwrote:

 Sounds like you're spending all your time in GC, which you can verify
 by checking what GCInspector and StatusLogger say in the log.

 Fix is increase your heap size or upgrade to 1.2:
 http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2

 On Wed, May 29, 2013 at 11:32 PM, srmore comom...@gmail.com wrote:
  Hello,
  I am observing that my performance is drastically decreasing when my
 data
  size grows. I have a 3 node cluster with 64 GB of ram and my data size
 is
  around 400GB on all the nodes. I also see that when I re-start
 Cassandra the
  performance goes back to normal and then again starts decreasing after
 some
  time.
 
  Some hunting landed me to this page
  http://wiki.apache.org/cassandra/LargeDataSetConsiderations which talks
  about the large data sets and explains that it might be because I am
 going
  through multiple layers of OS cache, but does not tell me how to tune
 it.
 
  So, my question is, are there any optimizations that I can do to handle
  these large datatasets ?
 
  and why does my performance go back to normal when I restart Cassandra ?
 
  Thanks !



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder, http://www.datastax.com
 @spyced

Re: data clean up problem

2013-05-28 Thread Bryan Talbot

I think what you're asking for (efficient removal of TTL'd write-once data)
is already in the works but not until 2.0 it seems.

https://issues.apache.org/jira/browse/CASSANDRA-5228

-Bryan



On Tue, May 28, 2013 at 1:26 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Oh and yes, astyanax uses client side response latency and cassandra does
 the same as a client of the other nodes.

 Dean

 On 5/28/13 2:23 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Actually, we did a huge investigation into this on astyanax and cassandra.
  Astyanax if I remember worked if configured correctly but casasndra did
 not so we patched cassandraŠfor some reason cassandra once on the
 co-ordinator who had one copy fo the data would wait for both other nodes
 to respond even though we are CL=QUOROM on RF=3Šwe put in patch for that
 which my teammate is still supposed to submit.  Cassandra should only wait
 for one nodeŠat least I think that is how I remember itŠ.We have it in our
 backlog to get the patch into cassandra.
 
 Previously one slow node would slow down our website but this no longer
 happens to us such that when compaction kicks off on a single node, our
 cluster keeps going strong.
 
 Dean
 
 On 5/28/13 2:12 PM, Dwight Smith dwight.sm...@genesyslab.com wrote:
 
 How do you determine the slow node, client side response latency?
 
 -Original Message-
 From: Hiller, Dean [mailto:dean.hil...@nrel.gov]
 Sent: Tuesday, May 28, 2013 1:10 PM
 To: user@cassandra.apache.org
 Subject: Re: data clean up problem
 
 How much disk used on each node?  We run the suggested  300G per node as
 above that compactions can have trouble keeping up.
 
 Ps. We run compactions during peak hours just fine because our client
 reroutes to the 2 of 3 nodes not running compactions based on seeing the
 slow node so performance stays fast.
 
 The easy route is to of course double your cluster and halve the data
 sizes per node so compaction can keep up.
 
 Dean
 
 From: cem cayiro...@gmail.commailto:cayiro...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, May 28, 2013 1:45 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: data clean up problem
 
 Thanks for the answer.
 
 Sorry for the misunderstanding. I tried to say I don't send delete
 request from the client so it safe to set gc_grace to 0. TTL is used for
 data clean up. I am not running a manual compaction. I tried that ones
 but it took a lot of time finish and I will not have this amount of
 off-peek time in the production to run this. I even set the compaction
 throughput to unlimited and it didnt help that much.
 
 Disk size just keeps on growing but I know that there is enough space to
 store 1 day data.
 
 What do you think about time rage partitioning? Creating new column
 family for each partition and drop when you know that all records are
 expired.
 
 I have 5 nodes.
 
 Cem.
 
 
 
 
 On Tue, May 28, 2013 at 9:37 PM, Hiller, Dean
 dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote:
 Also, how many nodes are you running?
 
 From: cem
 cayiro...@gmail.commailto:cayiro...@gmail.commailto:
 cayiroglu@gmail.c
 o
 mmailto:cayiro...@gmail.com
 Reply-To:
 user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:
 user@
 c
 assandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:
 user@
 c
 assandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, May 28, 2013 1:17 PM
 To:
 user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:
 user@
 c
 assandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:
 user@
 c
 assandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: data clean up problem
 
 Thanks for the answer but it is already set to 0 since I don't do any
 delete.
 
 Cem
 
 
 On Tue, May 28, 2013 at 9:03 PM, Edward Capriolo
 edlinuxg...@gmail.commailto:edlinuxg...@gmail.commailto:
 edlinuxguru@g
 m
 ail.commailto:edlinuxg...@gmail.com wrote:
 You need to change the gc_grace time of the column family. It defaults to
 10 days. By default the tombstones will not go away for 10 days.
 
 
 On Tue, May 28, 2013 at 2:46 PM, cem
 cayiro...@gmail.commailto:cayiro...@gmail.commailto:
 cayiroglu@gmail.c
 o
 mmailto:cayiro...@gmail.com wrote:
 Hi Experts,
 
 
 We have general problem about cleaning up data from the disk. I need to
 free the disk space after retention period and the customer wants to
 dimension the disk space base on that.
 
 After running multiple performance tests with TTL of 1 day we saw that
 the compaction couldn't keep up with the request rate. Disks were getting
 full after 3 days. There were also a lot of sstables that are older than
 1 day after 3 days.
 
 Things that we tried:
 
 -Change the compaction strategy to

Re: In a multiple data center setup, do all the data centers have complete data irrespective of RF?

2013-05-20 Thread Bryan Talbot

Option #3 since it depends on the placement strategy and not the
partitioner.

-Bryan



On Mon, May 20, 2013 at 6:24 AM, Pinak Pani 
nishant.has.a.quest...@gmail.com wrote:

 I just wanted to verify the fact that if I happen to setup a multi
 data-center Cassandra setup, will each data center have the complete
 data-set with it?

 Say, I have two data-center each with two nodes, and a partitioner that
 ranges from 0 to 100. Initial token assigned this way

 DC1:N1 = 00
 DC2:N1 = 25
 DC1:N2 = 50
 DC2:N2 = 75

 where DCX is data center X, NX is node X. *Which one the following
 options is true?*

 *Option #1: *DC1 and DC2, each will hold complete dataset with keys
 bucketed as follows
 DC1:N1 = (50, 00] = 50 keys
 DC1:N2 = (00, 50] = 50 keys
 
 Complete data set mirrored at DC1

 DC2:N1 = (75, 25] = 50 keys
 DC2:N2 = (25, 75] = 50 keys
 
 Complete data set mirrored at DC2

 *Option #2: *DC1 and DC2, each will hold 50% of the data with keys
 bucketed as follows (much the same way in a single C setup)
 DC1:N1 = (75, 00] = 25 keys
 DC2:N1 = (00, 25] = 25 keys
 DC1:N2 = (25, 50] = 25 keys
 DC2:N2 = (50, 75] = 25 keys
 
 data is divided into the two data centers.

 Thanks,
 PP

Re: In a multiple data center setup, do all the data centers have complete data irrespective of RF?

2013-05-20 Thread Bryan Talbot

On Mon, May 20, 2013 at 10:01 AM, Pinak Pani 
nishant.has.a.quest...@gmail.com wrote:

 Assume NetworkTopologyStrategy. So, I wanted to know whether a data-center
 will contain all the keys?

 This is the case:

 CREATE KEYSPACE appKS
   WITH placement_strategy = 'NetworkTopologyStrategy'
   AND strategy_options={DC1:3, DC2:3};

 Does DC1 and DC2 each contain complete database corpus? That is, if DC1
 blows, will I get all the data from DC2? Assume RF = 1.



Your config sample isn't RF=1 it's RF=3.  That's what the DC1:3 and DC2:3
mean -- set RF=3 for DC1 and RF=3 for DC2 for all rows of all CFs in this
keyspace.






 Sorry, for the very elementary question. This is the post that made me ask
 this question:
 http://www.onsip.com/blog/2011/07/15/intro-to-cassandra-and-networktopologystrategy

 It says,

 NTS creates an iterator for EACH datacenter and places writes discretely
 for each. The result is that NTS basically breaks each datacenter into it's
 own logical ring when it places writes.


A lot of things change in fast moving projects in 2 years, so you'll have
to take anything written 2 years ago with a grain of salt and figure out if
it's still true with whatever version you're using.





 That seems to mean that each data-center behaves as an independent ring
 with initial_token. So, If I have 2 data centers and NTS, I am basically
 mirroring the database. Right?



Depending on how you've configured your placement strategy, but if you're
using DC1:3 and DC2:3 like you have above, then yes, you'd expect to have 3
copies of every row in both data centers for that keyspace.

-Bryan

Re: update does not apply to any replica if consistency = ALL and one replica is down

2013-05-17 Thread Bryan Talbot

I think you're conflating may with must.  That article says that
updates may still be applied to some replicas when there is a failure and
I believe that still is the case.  However, if the coordinator knows that
the CL can't be met before even attempting the write, I don't think it will
attempt the write.

-Bryan



On Fri, May 17, 2013 at 1:48 AM, Sergey Naumov sknau...@gmail.com wrote:

 As described here (
 http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/),
 if consistency level couldn't be met, updates are applied anyway on
 functional replicas, and they could be propagated later to other replicas
 using repair mechanisms or by issuing the same request later, as update
 operations are idempotent in Cassandra.

 But... on my configuration (Cassandra 1.2.4, python CQL 1.0.4, DC1 - 3
 nodes, DC2 - 3 nodes, DC3 - 1 node, RF={DC1:3, DC2:2, DC3:1}, Random
 Partitioner, GossipingPropertyFileSnitch, one node in DC1 is deliberately
 down - and, as RF for DC1 is 3, this down node is a replica node for 100%
 of records),  when I try to insert one record with consistency level of
 ALL, this insert does not appear on any replica (-s30 - is a serial of
 UUID1: 001e--1000--x (30 is 1e in hex), -n1 mean
 that we will insert/update a single record with first id from this series -
 001e--1000--):
 *write with consistency ALL:*
 cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cALL
 Traceback (most recent call last):
   File ./aux/fastinsert.py, line 54, in insert
 curs.execute(cmd, consistency_level=p.conlvl)
 OperationalError: Unable to complete request: one or more nodes were
 unavailable.
 Last record UUID is 001e--1000--*

 *
 about 10 seconds passed...
 *
 read with consistency ONE:*
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cONE
 Total records read: *0*
 Last record UUID is 001e--1000--
 *read with consistency QUORUM:*
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
 Total records read: *0*
 Last record UUID is 001e--1000--
 *write with consistency QUORUM:*
 cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cQUORUM
 Last record UUID is 001e--1000--
 *read with consistency QUORUM:*
 cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
 Total records read: *1*
 Last record UUID is 001e--1000--

 Is it a new feature of Cassandra that it does not perform a write to any
 replica if consistency couldn't be satisfied? If so, then is it true for
 all cases, for example when returned error is OperationalError: Request
 did not complete within rpc_timeout?

 Thanks in advance,
 Sergey Naumov.

Re: SSTable size versus read performance

2013-05-16 Thread Bryan Talbot

512 sectors for read-ahead.  Are your new fancy SSD drives using large
sectors?  If your read-ahead is really reading 512 x 4KB per random IO,
then that 2 MB per read seems like a lot of extra overhead.

-Bryan




On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.com wrote:

 We actually have it set to 512.  I have tried decreasing my SSTable size
 to 5 MB and changing the chunk size to 8 kb

 From: Igor i...@4friends.od.ua
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, May 16, 2013 1:55 PM

 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: SSTable size versus read performance

 My 5 cents: I'd check blockdev --getra for data drives - too high values
 for readahead (default to 256 for debian) can hurt read performance.

Re: index_interval

2013-05-13 Thread Bryan Talbot

So will cassandra provide a way to limit its off-heap usage to avoid
unexpected OOM kills?  I'd much rather have performance degrade when 100%
of the index samples no longer fit in memory rather than the process being
killed with no way to stabilize it without adding hardware or removing data.

-Bryan


On Fri, May 10, 2013 at 7:44 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 If you use your off heap memory linux has an OOM killer, that will kill a
 random tasks.


 On Fri, May 10, 2013 at 11:34 AM, Bryan Talbot btal...@aeriagames.comwrote:

 If off-heap memory (for indes samples, bloom filters, row caches, key
 caches, etc) is exhausted, will cassandra experience a memory allocation
 error and quit?  If so, are there plans to make the off-heap usage more
 dynamic to allow less used pages to be replaced with hot data and the
 paged-out / cold data read back in again on demand?

Re: index_interval

2013-05-10 Thread Bryan Talbot

If off-heap memory (for indes samples, bloom filters, row caches, key
caches, etc) is exhausted, will cassandra experience a memory allocation
error and quit?  If so, are there plans to make the off-heap usage more
dynamic to allow less used pages to be replaced with hot data and the
paged-out / cold data read back in again on demand?

-Bryan



On Wed, May 8, 2013 at 4:24 PM, Jonathan Ellis jbel...@gmail.com wrote:

 index_interval won't be going away, but you won't need to change it as
 often in 2.0: https://issues.apache.org/jira/browse/CASSANDRA-5521

 On Mon, May 6, 2013 at 12:27 PM, Hiller, Dean dean.hil...@nrel.gov
 wrote:
  I heard a rumor that index_interval is going away?  What is the
 replacement for this?  (we have been having to play with this setting a lot
 lately as too big and it gets slow yet too small and cassandra uses way too
 much RAM…we are still trying to find the right balance with this setting).
 
  Thanks,
  Dean



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder, http://www.datastax.com
 @spyced

Re: Cassandra running High Load with no one using the cluster

2013-05-06 Thread Bryan Talbot

On Sat, May 4, 2013 at 9:22 PM, Aiman Parvaiz ai...@grapheffect.com wrote:


 When starting this cluster we set
  JVM_OPTS=$JVM_OPTS -Xss1000k




Why did you increase the stack-size to 5.5 times greater than recommended?
 Since each threads now uses 1000KB minimum just for the stack, a large
number of threads will use a large amount of memory.

-Bryan

Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-26 Thread Bryan Talbot

I believe that nodetool rebuild is used to add a new datacenter, not just
a new host to an existing cluster. Is that what you ran to add the node?

-Bryan

On Fri, Apr 26, 2013 at 1:27 PM, John Watson j...@disqus.com wrote:

Small relief we're not the only ones that had this issue.

We're going to try running a shuffle before adding a new node again...
maybe that will help

- John

On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral
fsob...@igcorp.com.br wrote:

I am using the same version and observed something similar.

I've added a new node, but the instructions from Datastax did not work
for me. Then I ran nodetool rebuild on the new node. After finished this
command, it contained two times the load of the other nodes. Even when I
ran nodetool cleanup on the older nodes, the situation was the same.

The problem only seemed to disappear when nodetool repair was applied
to all nodes.

Regards,
Francisco Sobral.

On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote:

After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running
upgradesstables, I figured it would be safe to start adding nodes to the
cluster. Guess not?

It seems when new nodes join, they are streamed *all* sstables in the
cluster.

https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png

The gray the line machine ran out disk space and for some reason cascaded
into errors in the cluster about 'no host id' when trying to store hints
for it (even though it hadn't joined yet).
The purple line machine, I just stopped the joining process because the
main cluster was dropping mutation messages at this point on a few nodes
(and it still had dozens of sstables to stream.)

I followed this:
http://www.datastax.com/docs/1.2/operations/add_replace_nodes

Is there something missing in that documentation?

Thanks,

John

Re: Cassandra services down frequently [Version 1.1.4]

2013-04-04 Thread Bryan Talbot

On Thu, Apr 4, 2013 at 1:27 AM, adeel.ak...@panasiangroup.com wrote:


 After some time (1 hour / 2 hour) cassandra shut services on one or two
 nodes with follwoing errors;



Wonder what the workload and schema is like ...

We can see from below that you've tweaked and disabled many of the memory
safety valve and other memory related settings.  Those could be causing
issues too.



 hinted_handoff_throttle_delay_**in_ms: 0
 flush_largest_memtables_at: 1.0
 reduce_cache_sizes_at: 1.0
 reduce_cache_capacity_to: 0.6
 rpc_keepalive: true
 rpc_server_type: sync
 rpc_min_threads: 16
 rpc_max_threads: 2147483647
 in_memory_compaction_limit_in_**mb: 256
 compaction_throughput_mb_per_**sec: 16
 rpc_timeout_in_ms: 15000
 dynamic_snitch_badness_**threshold: 0.0

Re: Timeseries data

2013-03-27 Thread Bryan Talbot

In the worst case, that is possible, but compaction strategies try to
minimize the number of SSTables that a row appears in so a row being in ALL
SStables is not likely for most cases.

-Bryan



On Wed, Mar 27, 2013 at 12:17 PM, Kanwar Sangha kan...@mavenir.com wrote:

  Hi – I have a query on Read with Cassandra. We are planning to have
 dynamic column family and each column would be on based a timeseries. 

 ** **

 Inserting data — key = ‘xxx′, {column_name = TimeUUID(now),
 :column_value = ‘value’ }, {column_name = TimeUUID(now), :column_value =
 ‘value’ },..

 ** **

 Now this key might be spread across multiple SSTables over a period of
 days. When we do a READ query to fetch say a slice of data from this row
 based on time X-Y , would it need to get data from ALL sstables ? 

 ** **

 Thanks,

 Kanwar

 ** **

Re: old data / tombstones are not deleted after ttl

2013-03-04 Thread Bryan Talbot

Those older files won't be included in a compaction until there are
min_compaction_threshold (4) files of that size.  When you get another SS
table -Data.db file that is about 12-18GB then you'll have 4 and they will
be compacted together into one new file.  At that time, if there are any
rows with only tombstones that are all older than gc_grace the row will be
removed (assuming the row exists exclusively in the 4 input SS tables).
 Columns with data that is more than TTL seconds old will be written with a
tombstone.  If the row does have column values in SS tables that are not
being compacted, the row will not be removed.


-Bryan


On Sun, Mar 3, 2013 at 11:07 PM, Matthias Zeilinger 
matthias.zeilin...@bwinparty.com wrote:

  Hi,

 ** **

 I´m running Cassandra 1.1.5 and have following issue.

 ** **

 I´m using a 10 days TTL on my CF. I can see a lot of tombstones in there,
 but they aren´t deleted after compaction.

 ** **

 I have tried a nodetool –cleanup and also a restart of Cassandra, but
 nothing happened.

 ** **

 total 61G

 drwxr-xr-x  2 cassandra dba  20K Mar  4 06:35 .

 drwxr-xr-x 10 cassandra dba 4.0K Dec 10 13:05 ..

 -rw-r--r--  1 cassandra dba  15M Dec 15 22:04
 whatever-he-1398-CompressionInfo.db

 -rw-r--r--  1 cassandra dba  19G Dec 15 22:04 whatever-he-1398-Data.db

 -rw-r--r--  1 cassandra dba  15M Dec 15 22:04 whatever-he-1398-Filter.db**
 **

 -rw-r--r--  1 cassandra dba 357M Dec 15 22:04 whatever-he-1398-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Dec 15 22:04
 whatever-he-1398-Statistics.db

 -rw-r--r--  1 cassandra dba 9.5M Feb  6 15:45
 whatever-he-5464-CompressionInfo.db

 -rw-r--r--  1 cassandra dba  12G Feb  6 15:45 whatever-he-5464-Data.db

 -rw-r--r--  1 cassandra dba  48M Feb  6 15:45 whatever-he-5464-Filter.db**
 **

 -rw-r--r--  1 cassandra dba 736M Feb  6 15:45 whatever-he-5464-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Feb  6 15:45
 whatever-he-5464-Statistics.db

 -rw-r--r--  1 cassandra dba 9.7M Feb 21 19:13
 whatever-he-6829-CompressionInfo.db

 -rw-r--r--  1 cassandra dba  12G Feb 21 19:13 whatever-he-6829-Data.db

 -rw-r--r--  1 cassandra dba  47M Feb 21 19:13 whatever-he-6829-Filter.db**
 **

 -rw-r--r--  1 cassandra dba 792M Feb 21 19:13 whatever-he-6829-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Feb 21 19:13
 whatever-he-6829-Statistics.db 

 -rw-r--r--  1 cassandra dba 3.7M Mar  1 10:46
 whatever-he-7578-CompressionInfo.db

 -rw-r--r--  1 cassandra dba 4.3G Mar  1 10:46 whatever-he-7578-Data.db

 -rw-r--r--  1 cassandra dba  12M Mar  1 10:46 whatever-he-7578-Filter.db**
 **

 -rw-r--r--  1 cassandra dba 274M Mar  1 10:46 whatever-he-7578-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Mar  1 10:46
 whatever-he-7578-Statistics.db

 -rw-r--r--  1 cassandra dba 3.6M Mar  1 11:21
 whatever-he-7582-CompressionInfo.db

 -rw-r--r--  1 cassandra dba 4.3G Mar  1 11:21 whatever-he-7582-Data.db

 -rw-r--r--  1 cassandra dba 9.7M Mar  1 11:21 whatever-he-7582-Filter.db**
 **

 -rw-r--r--  1 cassandra dba 236M Mar  1 11:21 whatever-he-7582-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Mar  1 11:21
 whatever-he-7582-Statistics.db

 -rw-r--r--  1 cassandra dba 3.7M Mar  3 12:13
 whatever-he-7869-CompressionInfo.db

 -rw-r--r--  1 cassandra dba 4.3G Mar  3 12:13 whatever-he-7869-Data.db

 -rw-r--r--  1 cassandra dba 9.8M Mar  3 12:13 whatever-he-7869-Filter.db**
 **

 -rw-r--r--  1 cassandra dba 239M Mar  3 12:13 whatever-he-7869-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Mar  3 12:13
 whatever-he-7869-Statistics.db

 -rw-r--r--  1 cassandra dba 924K Mar  3 18:02
 whatever-he-7953-CompressionInfo.db

 -rw-r--r--  1 cassandra dba 1.1G Mar  3 18:02 whatever-he-7953-Data.db

 -rw-r--r--  1 cassandra dba 2.1M Mar  3 18:02 whatever-he-7953-Filter.db**
 **

 -rw-r--r--  1 cassandra dba  51M Mar  3 18:02 whatever-he-7953-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Mar  3 18:02
 whatever-he-7953-Statistics.db

 -rw-r--r--  1 cassandra dba 231K Mar  3 20:06
 whatever-he-7974-CompressionInfo.db

 -rw-r--r--  1 cassandra dba 268M Mar  3 20:06 whatever-he-7974-Data.db

 -rw-r--r--  1 cassandra dba 483K Mar  3 20:06 whatever-he-7974-Filter.db**
 **

 -rw-r--r--  1 cassandra dba  12M Mar  3 20:06 whatever-he-7974-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Mar  3 20:06
 whatever-he-7974-Statistics.db

 -rw-r--r--  1 cassandra dba 116K Mar  4 06:28
 whatever-he-8002-CompressionInfo.db

 -rw-r--r--  1 cassandra dba 146M Mar  4 06:28 whatever-he-8002-Data.db

 -rw-r--r--  1 cassandra dba 646K Mar  4 06:28 whatever-he-8002-Filter.db**
 **

 -rw-r--r--  1 cassandra dba  16M Mar  4 06:28 whatever-he-8002-Index.db***
 *

 -rw-r--r--  1 cassandra dba 4.3K Mar  4 06:28
 whatever-he-8002-Statistics.db

 -rw-r--r--  1 cassandra dba  58K Mar  4 06:28
 whatever-he-8003-CompressionInfo.db

 -rw-r--r--  1 cassandra

Re: Reading old data problem

2013-02-28 Thread Bryan Talbot

On Thu, Feb 28, 2013 at 5:08 PM, Víctor Hugo Oliveira Molinar 
vhmoli...@gmail.com wrote:

 Ok guys let me try to ask it in a different way:

 Will repair totally ensure a data synchronism among nodes?


If there are no writes happening on the cluster then yes.  Otherwise, the
answer is it depends since all the normal things that lead to
inconsistencies can still happen.





 Extra question:
 Once I write at CL=All, will C* ensure that I can read from ANY node
 without an inconsistency? The reverse state, writing at CL=One but reading
 at CL=All will also ensure that?



You can get consistent behavior if CL.read + CL.write  RF.  So since you
have just 2 nodes and RF=2, you'd need to have at least CL.read=2 and
CL.write=1 or CL.read=1 and CL.write=2.

-Bryan





 On Wed, Feb 27, 2013 at 11:24 PM, Víctor Hugo Oliveira Molinar 
 vhmoli...@gmail.com wrote:

 Hello, I need some help to manage my live cluster!

 I'm  currently running a cluster with 2 nodes, RF:2, CL:1.
 Since I'm limited to hardware upgrade issues, I'm not able to increase my
 ConsitencyLevel for now.

 Anyway, * *I ran a full repair on each node of the cluster followed by a
 flush. Although I'm still reading old data when performing queries.

 Well it's know that I might read old data during normal operations, but
 shouldnt it be sync after the full antientropy repair?
 What I'm missing?

 Thanks in advance!

Re: heap usage

2013-02-15 Thread Bryan Talbot

Aren't bloom filters kept off heap in 1.2?
https://issues.apache.org/jira/browse/CASSANDRA-4865

Disabling bloom filters also disables tombstone removal as well, so don't
disable them if you delete anything.

https://issues.apache.org/jira/browse/CASSANDRA-5182

I believe that the index samples (by default every 128th entry) are still
kept in in memory so your JVM memory will scale with the number of rows
stored.  Additional memory is used for every keyspace and CF too so if you
have thousands of CF that could be an issue.

-Bryan



On Fri, Feb 15, 2013 at 8:16 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 It is not going to be true for long that LCS does not require bloom
 filters.

 https://issues.apache.org/jira/browse/CASSANDRA-5029

 Apparently, without bloom filters there are issues.

 On Fri, Feb 15, 2013 at 7:29 AM, Blake Manders bl...@crosspixel.net
 wrote:
 
  You probably want to look at your bloom filters.  Be forewarned though,
  they're difficult to change; changes to bloom filter settings only apply
 to
  new SSTables, so they might not be noticeable until a few compactions
 have
  taken place.
 
  If that is your issue, and your usage model fits it, a good alternative
 to
  the slow propagation of higher miss rates is to switch to LCS (which
 doesn't
  use bloom filters), which won't require you to make the jump to 1.2.
 
 
  On Fri, Feb 15, 2013 at 4:06 AM, Reik Schatz reik.sch...@gmail.com
 wrote:
 
  Hi,
 
  recently we are hitting some OOM: Java heap space, so I was
 investigating
  how the heap is used in Cassandra 1.2+
 
  We use the calculated 4G heap. Our cluster is 6 nodes, around 750 GB
 data
  and a replication factor of 3. Row cache is disabled. All key cache and
  memtable settings are left at default.
 
  Is the primary key index kept in heap memory? We have a bunch of
 keyspaces
  and column families.
 
  Thanks,
  Rik
 
 
 
 
  --
 
  Blake Manders | CTO
 
  Cross Pixel, Inc. | 494 8th Ave, Penthouse | NYC 10001
 
  Website: crosspixel.net
  Twitter: twitter.com/CrossPix

Re: Deletion consistency

2013-02-15 Thread Bryan Talbot

With a RF and CL of one, there is no replication so there can be no issue
with distributed deletes. Writes (and reads) can only go to the one host
that has the data and will be refused if that node is down. I'd guess that
your app isn't deleting records when you think that it is, or that the
delete is failing but not being detected as failed.

-Bryan

On Fri, Feb 15, 2013 at 10:21 AM, Mike mthero...@yahoo.com wrote:

If you increase the number of nodes to 3, with an RF of 3, then you should
be able to read/delete utilizing a quorum consistency level, which I
believe will help here. Also, make sure the time of your servers are in
sync, utilizing NTP, as drifting time between you client and server could
cause updates to be mistakenly dropped for being old.

Also, make sure you are running with a gc_grace period that is high
enough. The default is 10 days.

Hope this helps,
-Mike

On 2/15/2013 1:13 PM, Víctor Hugo Oliveira Molinar wrote:

hello everyone!

I have a column family filled with event objects which need to be
processed by query threads.
Once each thread query for those objects(spread among columns bellow a
row), it performs a delete operation for each object in cassandra.
It's done in order to ensure that these events wont be processed again.
Some tests has showed me that it works, but sometimes i'm not getting
those events deleted. I checked it through cassandra-cli,etc.

So, reading it
(http://wiki.apache.org/**cassandra/DistributedDeleteshttp://wiki.apache.org/cassandra/DistributedDeletes)
I came to a conclusion that I may be reading old data.
My cluster is currently configured as: 2 nodes, RF1, CL 1.
In that case, what should I do?

- Increase the consistency level for the write operations( in that case,
the deletions ). In order to ensure that those deletions are stored in all
nodes.
or
- Increase the consistency level for the read operations. In order to
ensure that I'm reading only those yet processed events(deleted).

-
Thanks in advance

Re: Cluster not accepting insert while one node is down

2013-02-14 Thread Bryan Talbot

Generally data isn't written to whatever node the client connects to. In
your case, a row is written to one of the nodes based on the hash of the
row key. If that one replica node is down, it won't matter which
coordinator node you attempt a write with CL.ONE: the write will fail.

If you want the write to succeed, you could do any one of: write with
CL.ANY, increase RF to 2+, write using a row key that hashes to an UP node.

-Bryan

On Thu, Feb 14, 2013 at 2:06 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

I will let commiters or anyone that has knowledge on Cassandra internal
answer this.

From what I understand, you should be able to insert data on any up node
with your configuration...

Alain

2013/2/14 Traian Fratean traian.frat...@gmail.com

You're right as regarding data availability on that node. And my config,
being the default one, is not suited for a cluster.
What I don't get is that my 67 node was down and I was trying to insert
in 66 node, as can be seen from the stacktrace. Long story short: when node
67 was down I could not insert into any machine in the cluster. Not what I
was expecting.

Thank you for the reply!
Traian.

2013/2/14 Alain RODRIGUEZ arodr...@gmail.com

Hi Traian,

There is your problem. You are using RF=1, meaning that each node is
responsible for its range, and nothing more. So when a node goes down, do
the math, you just can't read 1/5 of your data.

This is very cool for performances since each node owns its own part of
the data and any write or read need to reach only one node, but it removes
the SPOF, which is a main point of using C*. So you have poor availability
and poor consistency.

An usual configuration with 5 nodes would be RF=3 and both CL (RW) =
QUORUM.

This will replicate your data to 2 nodes + the natural endpoints (total
of 3/5 nodes owning any data) and any read or write would need to reach at
least 2 nodes before being considered as being successful ensuring a strong
consistency.

This configuration allow you to shut down a node (crash or configuration
update/rolling restart) without degrading the service (at least allowing
you to reach any data) but at cost of more data on each node.

Alain

2013/2/14 Traian Fratean traian.frat...@gmail.com

I am using defaults for both RF and CL. As the keyspace was created
using cassandra-cli the default RF should be 1 as I get it from below:

[default@TestSpace] describe;
Keyspace: TestSpace:
Replication Strategy:
org.apache.cassandra.locator.NetworkTopologyStrategy
Durable Writes: true
Options: [datacenter1:1]

As for the CL it the Astyanax default, which is 1 for both reads and
writes.

Traian.

2013/2/13 Alain RODRIGUEZ arodr...@gmail.com

We probably need more info like the RF of your cluster and CL of your
reads and writes. Maybe could you also tell us if you use vnodes or not.

I heard that Astyanax was not running very smoothly on 1.2.0, but a
bit better on 1.2.1. Yet, Netflix didn't release a version of Astyanax for
C*1.2.

Alain

2013/2/13 Traian Fratean traian.frat...@gmail.com

Hi,

I have a cluster of 5 nodes running Cassandra 1.2.0 . I have a Java
client with Astyanax 1.56.21.
When a node(10.60.15.67 - *diiferent* from the one in the stacktrace
below) went down I get TokenRandeOfflineException and no other data gets
inserted into *any other* node from the cluster.

Am I having a configuration issue or this is supposed to happen?

com.netflix.astyanax.connectionpool.impl.CountingConnectionPoolMonitor.trackError(CountingConnectionPoolMonitor.java:81)
-
com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException:
TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160,
latency=2057(2057), attempts=1]UnavailableException()
com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException:
TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160,
latency=2057(2057), attempts=1]UnavailableException()
at
com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165)
at
com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
at
com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:27)
at
com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$1.execute(ThriftSyncConnectionFactoryImpl.java:140)
at
com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69)
at
com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:255)

Thank you,
Traian.

Re: Upgrade from 0.6.x to 1.2.x

2013-02-07 Thread Bryan Talbot

Wow, that's pretty ambitions expecting an upgrade which skips 4 major
versions (0.7, 0.8, 1.0, 1.1) to work.

I think you're going to have to follow the upgrade path for each of those
intermediate steps and not upgrade in one big jump.

-Bryan



On Thu, Feb 7, 2013 at 3:41 AM, Sergey Leschenko sergle...@gmail.comwrote:

 Hi, all

 I'm trying to update our old version 0.6.5 to current 1.2.1
 All nodes has been drained and stopped. Proper cassandra.yaml created,
 schema file prepared.

 Trying to start version 1.2.1 on the one node  (full output attached to
 email):
 ...
 ERROR 11:12:44,530 Exception encountered during startup
 java.lang.NullPointerException
 at
 org.apache.cassandra.db.SystemTable.upgradeSystemData(SystemTable.java:161)
 at
 org.apache.cassandra.db.SystemTable.finishStartup(SystemTable.java:107)
 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:276)
 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:370)
 at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:413)
 java.lang.NullPointerException
 at
 org.apache.cassandra.db.SystemTable.upgradeSystemData(SystemTable.java:161)
 at
 org.apache.cassandra.db.SystemTable.finishStartup(SystemTable.java:107)
 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:276)
 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:370)
 at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:413)
 Exception encountered during startup: null

 On the next attempts daemon started, but still with AssertionErrors

 Question 1 - is it possible start the new version from the first attempt?


 Then I loaded schema via cassandra-cli, and run nodetool scrub - which
 caused a big number of  warnings in log:
OutputHandler.java (line 52) Index file contained a different key
 or row size; using key from data file

 storage-conf.xml from 0.6.5 has column family defined as
ColumnFamily Name=Invoices CompareWith=BytesType/
 for 1.2.1 I used
   create column family Invoices with column_type = 'Standard' and
 comparator = 'BytesType';

 Question 2 - how to get rid of these warnings? Are they connected to
 column family definition?

 Thanks

 --
 Sergey

Re: too many warnings of Heap is full

2013-01-30 Thread Bryan Talbot

My guess is that those one or two nodes with the gc pressure also have more
rows in your big CF.  More rows could be due to imbalanced distribution if
your'e not using a random partitioner or from those nodes not yet removing
deleted rows which other nodes may have done.

JVM heap space is used for a few things which scale with key count
including:
- bloom filter (for C*  1.2)
- index samples

Other space is used but can be more easily controlled by tuning for
- memtable
- compaction
- key cache
- row cache


So, if those nodes have more rows (check using nodetool ring or nodetool
cfstats) than the others you can try to:
- reduce the number of rows by adding nodes, run manual / tune compactions
to remove rows with expired tombstones, etc.
- increase bloom filter fp chance
- increase jvm heap size (don't go too big)
- disable key or row cache
- increase index sample interval

Not all of those things are generally good especially to the extreme so
don't go setting a 20 GB jvm heap without understanding
the consequences for example.

-Bryan


On Wed, Jan 30, 2013 at 3:47 AM, Guillermo Barbero 
guillermo.barb...@spotbros.com wrote:

 Hi,

   I'm viewing a weird behaviour in my cassandra cluster. Most of the
 warning messages are due to Heap is % full. According to this link
 (
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassndra-1-0-6-GC-query-tt7323457.html
 )
 there are two ways to reduce pressure:
 1. Decrease the cache sizes
 2. Increase the index interval size

 Most of the flushes are in two column families (users and messages), I
 guess that's because the most mutations are there.

 I still have not applied those changes to the production environment.
 Do you recommend any other meassure? Should I set specific tunning for
 these two CFs? Should I check another metric?

 Additionally, the distribution of warning messages is not uniform
 along the cluster. Why could cassandra be doing this? What should I do
 to find out how to fix this?

 cassandra runs on a 6 node cluster of m1.xlarge machines (Amazon EC2)
 the java version is the following:
 java version 1.6.0_37
 Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
 Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)

 The cassandra system.log is resumed here (numer of messages, cassandra
 node, class that reports the message, first word of the message)
 2013-01-26
   5 cassNode0: GCInspector.java Heap
   5 cassNode0: StorageService.java Flushing
 232 cassNode2: GCInspector.java Heap
 232 cassNode2: StorageService.java Flushing
 104 cassNode3: GCInspector.java Heap
 104 cassNode3: StorageService.java Flushing
   3 cassNode4: GCInspector.java Heap
   3 cassNode4: StorageService.java Flushing
   3 cassNode5: GCInspector.java Heap
   3 cassNode5: StorageService.java Flushing

 2013-01-27
   2 cassNode0: GCInspector.java Heap
   2 cassNode0: StorageService.java Flushing
   3 cassNode1: GCInspector.java Heap
   3 cassNode1: StorageService.java Flushing
 189 cassNode2: GCInspector.java Heap
 189 cassNode2: StorageService.java Flushing
 104 cassNode3: GCInspector.java Heap
 104 cassNode3: StorageService.java Flushing
   1 cassNode4: GCInspector.java Heap
   1 cassNode4: StorageService.java Flushing
   1 cassNode5: GCInspector.java Heap
   1 cassNode5: StorageService.java Flushing

 2013-01-28
   2 cassNode0: GCInspector.java Heap
   2 cassNode0: StorageService.java Flushing
   1 cassNode1: GCInspector.java Heap
   1 cassNode1: StorageService.java Flushing
   1 cassNode2: AutoSavingCache.java Reducing
 343 cassNode2: GCInspector.java Heap
 342 cassNode2: StorageService.java Flushing
 181 cassNode3: GCInspector.java Heap
 181 cassNode3: StorageService.java Flushing
   4 cassNode4: GCInspector.java Heap
   4 cassNode4: StorageService.java Flushing
   3 cassNode5: GCInspector.java Heap
   3 cassNode5: StorageService.java Flushing

 2013-01-29
   2 cassNode0: GCInspector.java Heap
   2 cassNode0: StorageService.java Flushing
   3 cassNode1: GCInspector.java Heap
   3 cassNode1: StorageService.java Flushing
 156 cassNode2: GCInspector.java Heap
 156 cassNode2: StorageService.java Flushing
  71 cassNode3: GCInspector.java Heap
  71 cassNode3: StorageService.java Flushing
   2 cassNode4: GCInspector.java Heap
   2 cassNode4: StorageService.java Flushing
   2 cassNode5: GCInspector.java Heap
   1 cassNode5: Memtable.java setting
   2 cassNode5: StorageService.java Flushing

 --

 Guillermo Barbero - Backend Team

 Spotbros Technologies

Re: LCS not removing rows with all TTL expired columns

2013-01-22 Thread Bryan Talbot

It turns out that having gc_grace=0 isn't required to produce the problem.
 My colleague did a lot of digging into the compaction code and we think
he's found the issue.  It's detailed in
https://issues.apache.org/jira/browse/CASSANDRA-5182

Basically tombstones for a row will not be removed from an SSTable during
compaction if the row appears in other SSTables; however, the compaction
code checks the bloom filters to make this determination.  Since this data
is rarely read we had the bloom_filter_fp_ratio set to 1.0 which makes rows
seem to appear in every SSTable as far as compaction is concerned.

This caused our data to essentially never be removed when using either STSC
or LCS and will probably affect anyone else running 1.1 with high bloom
filter fp ratios.

Setting our fp ratio to 0.1, running upgradesstables and running the
application as it was before seems to have stabilized the load as desired
at the expense of additional jvm memory.

-Bryan


On Thu, Jan 17, 2013 at 6:50 PM, Bryan Talbot btal...@aeriagames.comwrote:

 Bleh, I rushed out the email before some meetings and I messed something
 up.  Working on reproducing now with better notes this time.

 -Bryan



 On Thu, Jan 17, 2013 at 4:45 PM, Derek Williams de...@fyrie.net wrote:

 When you ran this test, is that the exact schema you used? I'm not seeing
 where you are setting gc_grace to 0 (although I could just be blind, it
 happens).


 On Thu, Jan 17, 2013 at 5:01 PM, Bryan Talbot btal...@aeriagames.comwrote:

 I'm able to reproduce this behavior on my laptop using 1.1.5, 1.1.7,
 1.1.8, a trivial schema, and a simple script that just inserts rows.  If
 the TTL is small enough so that all LCS data fits in generation 0 then the
 rows seem to be removed with TTL expires as desired.  However, if the
 insertion rate is high enough or the TTL long enough then the data keep
 accumulating for far longer than expected.

 Using 120 second TTL and a single threaded php insertion script my MBP
 with SSD retained almost all of the data.  120 seconds should accumulate
 5-10 MB of data.  I would expect that TTL rows to be removed eventually and
 for the cassandra load to level off at some reasonable value near 10 MB.
  After running for 2 hours and with a cassandra load of ~550 MB I stopped
 the test.

 The schema is

 create keyspace test
   with placement_strategy = 'SimpleStrategy'
   and strategy_options = {replication_factor : 1}
   and durable_writes = true;

 use test;

 create column family test
   with column_type = 'Standard'
   and comparator = 'UTF8Type'
   and default_validation_class = 'UTF8Type'
   and key_validation_class = 'TimeUUIDType'
   and compaction_strategy =
 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
   and caching = 'NONE'
   and bloom_filter_fp_chance = 1.0
   and column_metadata = [
 {column_name : 'a',
 validation_class : LongType}];


 and the insert script is

 ?php

 require_once('phpcassa/1.0.a.5/autoload.php');

 use phpcassa\Connection\ConnectionPool;
 use phpcassa\ColumnFamily;
 use phpcassa\SystemManager;
 use phpcassa\UUID;

 // Connect to test keyspace and column family
 $sys = new SystemManager('127.0.0.1');

 // Start a connection pool, create our ColumnFamily instance
 $pool = new ConnectionPool('test', array('127.0.0.1'));
 $testCf = new ColumnFamily($pool, 'test');

 // Insert records
 while( 1 ) {
   $testCf-insert(UUID::uuid1(), array(a = 1), null, 120);
 }

 // Close our connections
 $pool-close();
 $sys-close();

 ?


 -Bryan




 On Thu, Jan 17, 2013 at 10:11 AM, Bryan Talbot 
 btal...@aeriagames.comwrote:

 We are using LCS and the particular row I've referenced has been
 involved in several compactions after all columns have TTL expired.  The
 most recent one was again this morning and the row is still there -- TTL
 expired for several days now with gc_grace=0 and several compactions later
 ...


 $ ./bin/nodetool -h localhost getsstables metrics request_summary
 459fb460-5ace-11e2-9b92-11d67b6163b4

 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db

 $ ls -alF
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
 -rw-rw-r-- 1 sandra sandra 5246509 Jan 17 06:54
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db


 $ ./bin/sstable2json
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
 -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 
 %x')
  {
 34353966623436302d356163652d313165322d396239322d313164363762363136336234:
 [[app_name,50f21d3d,1357785277207001,d],
 [client_ip,50f21d3d,1357785277207001,d],
 [client_req_id,50f21d3d,1357785277207001,d],
 [mysql_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_duration_us,50f21d3d,1357785277207001,d],
 [mysql_failure_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_success_call_cnt,50f21d3d,1357785277207001,d],
 [req_duration_us,50f21d3d

Re: LCS not removing rows with all TTL expired columns

2013-01-17 Thread Bryan Talbot


 ** **

 On 17/01/2013, at 2:55 PM, Bryan Talbot btal...@aeriagames.com wrote:***
 *



 

 According to the timestamps (see original post) the SSTable was written
 (thus compacted compacted) 3 days after all columns for that row had
 expired and 6 days after the row was created; yet all columns are still
 showing up in the SSTable.  Note that the column shows now rows when a
 get for that key is run so that's working correctly, but the data is
 lugged around far longer than it should be -- maybe forever.

 ** **

 ** **

 -Bryan

 ** **

 On Wed, Jan 16, 2013 at 5:44 PM, Andrey Ilinykh ailin...@gmail.com
 wrote:

 To get column removed you have to meet two requirements 

 1. column should be expired

 2. after that CF gets compacted

 ** **

 I guess your expired columns are propagated to high tier CF, which gets
 compacted rarely.

 So, you have to wait when high tier CF gets compacted.  

 ** **

 Andrey

 ** **

 ** **

 On Wed, Jan 16, 2013 at 11:39 AM, Bryan Talbot btal...@aeriagames.com
 wrote:

 On cassandra 1.1.5 with a write heavy workload, we're having problems
 getting rows to be compacted away (removed) even though all columns have
 expired TTL.  We've tried size tiered and now leveled and are seeing the
 same symptom: the data stays around essentially forever.  

 ** **

 Currently we write all columns with a TTL of 72 hours (259200 seconds) and
 expect to add 10 GB of data to this CF per day per node.  Each node
 currently has 73 GB for the affected CF and shows no indications that old
 rows will be removed on their own.

 ** **

 Why aren't rows being removed?  Below is some data from a sample row which
 should have been removed several days ago but is still around even though
 it has been involved in numerous compactions since being expired.

 ** **

 $ ./bin/nodetool -h localhost getsstables metrics request_summary
 459fb460-5ace-11e2-9b92-11d67b6163b4


 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
 

 ** **

 $ ls -alF
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
 

 -rw-rw-r-- 1 sandra sandra 5252320 Jan 16 08:42
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
 

 ** **

 $ ./bin/sstable2json
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
 -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 %x')
 

 {

 34353966623436302d356163652d313165322d396239322d313164363762363136336234:
 [[app_name,50f21d3d,1357785277207001,d],
 [client_ip,50f21d3d,1357785277207001,d],
 [client_req_id,50f21d3d,1357785277207001,d],
 [mysql_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_duration_us,50f21d3d,1357785277207001,d],
 [mysql_failure_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_success_call_cnt,50f21d3d,1357785277207001,d],
 [req_duration_us,50f21d3d,1357785277207001,d],
 [req_finish_time_us,50f21d3d,1357785277207001,d],
 [req_method,50f21d3d,1357785277207001,d],
 [req_service,50f21d3d,1357785277207001,d],
 [req_start_time_us,50f21d3d,1357785277207001,d],
 [success,50f21d3d,1357785277207001,d]]

 }

 ** **

 ** **

 Decoding the column timestamps to shows that the columns were written at
 Thu, 10 Jan 2013 02:34:37 GMT and that their TTL expired at Sun, 13 Jan
 2013 02:34:37 GMT.  The date of the SSTable shows that it was generated on
 Jan 16 which is 3 days after all columns have TTL-ed out.

 ** **

 ** **

 The schema shows that gc_grace is set to 0 since this data is write-once,
 read-seldom and is never updated or deleted.

 ** **

 create column family request_summary

   with column_type = 'Standard'

   and comparator = 'UTF8Type'

   and default_validation_class = 'UTF8Type'

   and key_validation_class = 'UTF8Type'

   and read_repair_chance = 0.1

   and dclocal_read_repair_chance = 0.0

   and gc_grace = 0

   and min_compaction_threshold = 4

   and max_compaction_threshold = 32

   and replicate_on_write = true

   and compaction_strategy =
 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'

   and caching = 'NONE'

   and bloom_filter_fp_chance = 1.0

   and compression_options = {'chunk_length_kb' : '64',
 'sstable_compression' :
 'org.apache.cassandra.io.compress.SnappyCompressor'};

 ** **

 ** **

 Thanks in advance for help in understanding why rows such as this are not
 removed!

 ** **

 -Bryan

 ** **

 ** **

 ** **

 ** **

signature-best-employer-logo4823.pngsignature-logo29.png

Re: LCS not removing rows with all TTL expired columns

2013-01-17 Thread Bryan Talbot

I'm able to reproduce this behavior on my laptop using 1.1.5, 1.1.7, 1.1.8,
a trivial schema, and a simple script that just inserts rows.  If the TTL
is small enough so that all LCS data fits in generation 0 then the rows
seem to be removed with TTL expires as desired.  However, if the insertion
rate is high enough or the TTL long enough then the data keep accumulating
for far longer than expected.

Using 120 second TTL and a single threaded php insertion script my MBP with
SSD retained almost all of the data.  120 seconds should accumulate 5-10 MB
of data.  I would expect that TTL rows to be removed eventually and for the
cassandra load to level off at some reasonable value near 10 MB.  After
running for 2 hours and with a cassandra load of ~550 MB I stopped the test.

The schema is

create keyspace test
  with placement_strategy = 'SimpleStrategy'
  and strategy_options = {replication_factor : 1}
  and durable_writes = true;

use test;

create column family test
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'TimeUUIDType'
  and compaction_strategy =
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'NONE'
  and bloom_filter_fp_chance = 1.0
  and column_metadata = [
{column_name : 'a',
validation_class : LongType}];


and the insert script is

?php

require_once('phpcassa/1.0.a.5/autoload.php');

use phpcassa\Connection\ConnectionPool;
use phpcassa\ColumnFamily;
use phpcassa\SystemManager;
use phpcassa\UUID;

// Connect to test keyspace and column family
$sys = new SystemManager('127.0.0.1');

// Start a connection pool, create our ColumnFamily instance
$pool = new ConnectionPool('test', array('127.0.0.1'));
$testCf = new ColumnFamily($pool, 'test');

// Insert records
while( 1 ) {
  $testCf-insert(UUID::uuid1(), array(a = 1), null, 120);
}

// Close our connections
$pool-close();
$sys-close();

?


-Bryan




On Thu, Jan 17, 2013 at 10:11 AM, Bryan Talbot btal...@aeriagames.comwrote:

 We are using LCS and the particular row I've referenced has been involved
 in several compactions after all columns have TTL expired.  The most recent
 one was again this morning and the row is still there -- TTL expired for
 several days now with gc_grace=0 and several compactions later ...


 $ ./bin/nodetool -h localhost getsstables metrics request_summary
 459fb460-5ace-11e2-9b92-11d67b6163b4

 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db

 $ ls -alF
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
 -rw-rw-r-- 1 sandra sandra 5246509 Jan 17 06:54
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db


 $ ./bin/sstable2json
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
 -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 %x')
 {
 34353966623436302d356163652d313165322d396239322d313164363762363136336234:
 [[app_name,50f21d3d,1357785277207001,d],
 [client_ip,50f21d3d,1357785277207001,d],
 [client_req_id,50f21d3d,1357785277207001,d],
 [mysql_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_duration_us,50f21d3d,1357785277207001,d],
 [mysql_failure_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_success_call_cnt,50f21d3d,1357785277207001,d],
 [req_duration_us,50f21d3d,1357785277207001,d],
 [req_finish_time_us,50f21d3d,1357785277207001,d],
 [req_method,50f21d3d,1357785277207001,d],
 [req_service,50f21d3d,1357785277207001,d],
 [req_start_time_us,50f21d3d,1357785277207001,d],
 [success,50f21d3d,1357785277207001,d]]
 }


 My experience with TTL columns so far has been pretty similar to Viktor's
 in that the only way to keep them row count under control is to force major
 compactions.  In real world use, STCS and LCS both leave TTL expired rows
 around forever as far as I can tell.  When testing with minimal data,
 removal of TTL expired rows seem to work as expected but in this case there
 seems to be some divergence from real life work and test samples.

 -Bryan




 On Thu, Jan 17, 2013 at 1:47 AM, Viktor Jevdokimov 
 viktor.jevdoki...@adform.com wrote:

  @Bryan,

 ** **

 To keep data size as low as possible with TTL columns we still use STCS
 and nightly major compactions.

 ** **

 Experience with LCS was not successful in our case, data size keeps too
 high along with amount of compactions.

 ** **

 IMO, before 1.2, LCS was good for CFs without TTL or high delete rate. I
 have not tested 1.2 LCS behavior, we’re still on 1.0.x

 ** **

 ** **
Best regards / Pagarbiai
 *Viktor Jevdokimov*
 Senior Developer

 Email: viktor.jevdoki...@adform.com
 Phone: +370 5 212 3063, Fax +370 5 261 0453
 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
 Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider
 Take a ride with Adform's Rich Media Suitehttp://vimeo.com/adform

Re: LCS not removing rows with all TTL expired columns

2013-01-17 Thread Bryan Talbot

Bleh, I rushed out the email before some meetings and I messed something
up.  Working on reproducing now with better notes this time.

-Bryan



On Thu, Jan 17, 2013 at 4:45 PM, Derek Williams de...@fyrie.net wrote:

 When you ran this test, is that the exact schema you used? I'm not seeing
 where you are setting gc_grace to 0 (although I could just be blind, it
 happens).


 On Thu, Jan 17, 2013 at 5:01 PM, Bryan Talbot btal...@aeriagames.comwrote:

 I'm able to reproduce this behavior on my laptop using 1.1.5, 1.1.7,
 1.1.8, a trivial schema, and a simple script that just inserts rows.  If
 the TTL is small enough so that all LCS data fits in generation 0 then the
 rows seem to be removed with TTL expires as desired.  However, if the
 insertion rate is high enough or the TTL long enough then the data keep
 accumulating for far longer than expected.

 Using 120 second TTL and a single threaded php insertion script my MBP
 with SSD retained almost all of the data.  120 seconds should accumulate
 5-10 MB of data.  I would expect that TTL rows to be removed eventually and
 for the cassandra load to level off at some reasonable value near 10 MB.
  After running for 2 hours and with a cassandra load of ~550 MB I stopped
 the test.

 The schema is

 create keyspace test
   with placement_strategy = 'SimpleStrategy'
   and strategy_options = {replication_factor : 1}
   and durable_writes = true;

 use test;

 create column family test
   with column_type = 'Standard'
   and comparator = 'UTF8Type'
   and default_validation_class = 'UTF8Type'
   and key_validation_class = 'TimeUUIDType'
   and compaction_strategy =
 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
   and caching = 'NONE'
   and bloom_filter_fp_chance = 1.0
   and column_metadata = [
 {column_name : 'a',
 validation_class : LongType}];


 and the insert script is

 ?php

 require_once('phpcassa/1.0.a.5/autoload.php');

 use phpcassa\Connection\ConnectionPool;
 use phpcassa\ColumnFamily;
 use phpcassa\SystemManager;
 use phpcassa\UUID;

 // Connect to test keyspace and column family
 $sys = new SystemManager('127.0.0.1');

 // Start a connection pool, create our ColumnFamily instance
 $pool = new ConnectionPool('test', array('127.0.0.1'));
 $testCf = new ColumnFamily($pool, 'test');

 // Insert records
 while( 1 ) {
   $testCf-insert(UUID::uuid1(), array(a = 1), null, 120);
 }

 // Close our connections
 $pool-close();
 $sys-close();

 ?


 -Bryan




 On Thu, Jan 17, 2013 at 10:11 AM, Bryan Talbot btal...@aeriagames.comwrote:

 We are using LCS and the particular row I've referenced has been
 involved in several compactions after all columns have TTL expired.  The
 most recent one was again this morning and the row is still there -- TTL
 expired for several days now with gc_grace=0 and several compactions later
 ...


 $ ./bin/nodetool -h localhost getsstables metrics request_summary
 459fb460-5ace-11e2-9b92-11d67b6163b4

 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db

 $ ls -alF
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
 -rw-rw-r-- 1 sandra sandra 5246509 Jan 17 06:54
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db


 $ ./bin/sstable2json
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
 -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 %x')
  {
 34353966623436302d356163652d313165322d396239322d313164363762363136336234:
 [[app_name,50f21d3d,1357785277207001,d],
 [client_ip,50f21d3d,1357785277207001,d],
 [client_req_id,50f21d3d,1357785277207001,d],
 [mysql_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_duration_us,50f21d3d,1357785277207001,d],
 [mysql_failure_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_success_call_cnt,50f21d3d,1357785277207001,d],
 [req_duration_us,50f21d3d,1357785277207001,d],
 [req_finish_time_us,50f21d3d,1357785277207001,d],
 [req_method,50f21d3d,1357785277207001,d],
 [req_service,50f21d3d,1357785277207001,d],
 [req_start_time_us,50f21d3d,1357785277207001,d],
 [success,50f21d3d,1357785277207001,d]]
 }


 My experience with TTL columns so far has been pretty similar to
 Viktor's in that the only way to keep them row count under control is to
 force major compactions.  In real world use, STCS and LCS both leave TTL
 expired rows around forever as far as I can tell.  When testing with
 minimal data, removal of TTL expired rows seem to work as expected but in
 this case there seems to be some divergence from real life work and test
 samples.

 -Bryan




 On Thu, Jan 17, 2013 at 1:47 AM, Viktor Jevdokimov 
 viktor.jevdoki...@adform.com wrote:

  @Bryan,

 ** **

 To keep data size as low as possible with TTL columns we still use STCS
 and nightly major compactions.

 ** **

 Experience with LCS was not successful in our case, data size keeps too
 high along with amount of compactions

LCS not removing rows with all TTL expired columns

2013-01-16 Thread Bryan Talbot

On cassandra 1.1.5 with a write heavy workload, we're having problems
getting rows to be compacted away (removed) even though all columns have
expired TTL.  We've tried size tiered and now leveled and are seeing the
same symptom: the data stays around essentially forever.

Currently we write all columns with a TTL of 72 hours (259200 seconds) and
expect to add 10 GB of data to this CF per day per node.  Each node
currently has 73 GB for the affected CF and shows no indications that old
rows will be removed on their own.

Why aren't rows being removed?  Below is some data from a sample row which
should have been removed several days ago but is still around even though
it has been involved in numerous compactions since being expired.

$ ./bin/nodetool -h localhost getsstables metrics request_summary
459fb460-5ace-11e2-9b92-11d67b6163b4
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db

$ ls -alF
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
-rw-rw-r-- 1 sandra sandra 5252320 Jan 16 08:42
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db

$ ./bin/sstable2json
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
-k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 %x')
{
34353966623436302d356163652d313165322d396239322d313164363762363136336234:
[[app_name,50f21d3d,1357785277207001,d],
[client_ip,50f21d3d,1357785277207001,d],
[client_req_id,50f21d3d,1357785277207001,d],
[mysql_call_cnt,50f21d3d,1357785277207001,d],
[mysql_duration_us,50f21d3d,1357785277207001,d],
[mysql_failure_call_cnt,50f21d3d,1357785277207001,d],
[mysql_success_call_cnt,50f21d3d,1357785277207001,d],
[req_duration_us,50f21d3d,1357785277207001,d],
[req_finish_time_us,50f21d3d,1357785277207001,d],
[req_method,50f21d3d,1357785277207001,d],
[req_service,50f21d3d,1357785277207001,d],
[req_start_time_us,50f21d3d,1357785277207001,d],
[success,50f21d3d,1357785277207001,d]]
}


Decoding the column timestamps to shows that the columns were written at
Thu, 10 Jan 2013 02:34:37 GMT and that their TTL expired at Sun, 13 Jan
2013 02:34:37 GMT.  The date of the SSTable shows that it was generated on
Jan 16 which is 3 days after all columns have TTL-ed out.


The schema shows that gc_grace is set to 0 since this data is write-once,
read-seldom and is never updated or deleted.

create column family request_summary
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UTF8Type'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'NONE'
  and bloom_filter_fp_chance = 1.0
  and compression_options = {'chunk_length_kb' : '64',
'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};


Thanks in advance for help in understanding why rows such as this are not
removed!

-Bryan

Re: LCS not removing rows with all TTL expired columns

2013-01-16 Thread Bryan Talbot

According to the timestamps (see original post) the SSTable was written
(thus compacted compacted) 3 days after all columns for that row had
expired and 6 days after the row was created; yet all columns are still
showing up in the SSTable.  Note that the column shows now rows when a
get for that key is run so that's working correctly, but the data is
lugged around far longer than it should be -- maybe forever.


-Bryan


On Wed, Jan 16, 2013 at 5:44 PM, Andrey Ilinykh ailin...@gmail.com wrote:

 To get column removed you have to meet two requirements
 1. column should be expired
 2. after that CF gets compacted

 I guess your expired columns are propagated to high tier CF, which gets
 compacted rarely.
 So, you have to wait when high tier CF gets compacted.

 Andrey



 On Wed, Jan 16, 2013 at 11:39 AM, Bryan Talbot btal...@aeriagames.comwrote:

 On cassandra 1.1.5 with a write heavy workload, we're having problems
 getting rows to be compacted away (removed) even though all columns have
 expired TTL.  We've tried size tiered and now leveled and are seeing the
 same symptom: the data stays around essentially forever.

 Currently we write all columns with a TTL of 72 hours (259200 seconds)
 and expect to add 10 GB of data to this CF per day per node.  Each node
 currently has 73 GB for the affected CF and shows no indications that old
 rows will be removed on their own.

 Why aren't rows being removed?  Below is some data from a sample row
 which should have been removed several days ago but is still around even
 though it has been involved in numerous compactions since being expired.

 $ ./bin/nodetool -h localhost getsstables metrics request_summary
 459fb460-5ace-11e2-9b92-11d67b6163b4

 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db

 $ ls -alF
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
 -rw-rw-r-- 1 sandra sandra 5252320 Jan 16 08:42
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db

 $ ./bin/sstable2json
 /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
 -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 %x')
 {
 34353966623436302d356163652d313165322d396239322d313164363762363136336234:
 [[app_name,50f21d3d,1357785277207001,d],
 [client_ip,50f21d3d,1357785277207001,d],
 [client_req_id,50f21d3d,1357785277207001,d],
 [mysql_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_duration_us,50f21d3d,1357785277207001,d],
 [mysql_failure_call_cnt,50f21d3d,1357785277207001,d],
 [mysql_success_call_cnt,50f21d3d,1357785277207001,d],
 [req_duration_us,50f21d3d,1357785277207001,d],
 [req_finish_time_us,50f21d3d,1357785277207001,d],
 [req_method,50f21d3d,1357785277207001,d],
 [req_service,50f21d3d,1357785277207001,d],
 [req_start_time_us,50f21d3d,1357785277207001,d],
 [success,50f21d3d,1357785277207001,d]]
 }


 Decoding the column timestamps to shows that the columns were written at
 Thu, 10 Jan 2013 02:34:37 GMT and that their TTL expired at Sun, 13 Jan
 2013 02:34:37 GMT.  The date of the SSTable shows that it was generated on
 Jan 16 which is 3 days after all columns have TTL-ed out.


 The schema shows that gc_grace is set to 0 since this data is write-once,
 read-seldom and is never updated or deleted.

 create column family request_summary
   with column_type = 'Standard'
   and comparator = 'UTF8Type'
   and default_validation_class = 'UTF8Type'
   and key_validation_class = 'UTF8Type'
   and read_repair_chance = 0.1
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 0
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = true
   and compaction_strategy =
 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
   and caching = 'NONE'
   and bloom_filter_fp_chance = 1.0
   and compression_options = {'chunk_length_kb' : '64',
 'sstable_compression' :
 'org.apache.cassandra.io.compress.SnappyCompressor'};


 Thanks in advance for help in understanding why rows such as this are not
 removed!

 -Bryan

Re: State of Cassandra and Java 7

2012-12-21 Thread Bryan Talbot

Brian, did any of your issues with java 7 result in corrupting data in
cassandra?

We just ran into an issue after upgrading a test cluster from Cassandra
1.1.5 and Oracle JDK 1.6.0_29-b11 to Cassandra 1.1.7 and 7u10.

What we saw is values in columns with validation
Class=org.apache.cassandra.db.marshal.LongType that were proper integers
becoming corrupted so that they become stored as strings.  I don't have
a reproducible test case yet but will work on making one over the holiday
if I can.

For example, a column with a long type that was originally written and
stored properly (say with value 1200) was somehow changed during cassandra
operations (compaction seems the only possibility) to be the value '1200'
with quotes.

The data was written using the phpcassa library and that application and
library haven't been changed.  This has only happened on our test cluster
which was upgraded and hasn't happened on our live cluster which was not
upgraded.  Many of our column families were affected and all affected
columns are Long (or bigint for cql3).

Errors when reading using CQL3 command client look like this:

Failed to decode value '1356441225' (for column 'expires') as bigint:
unpack requires a string argument of length 8

and when reading with cassandra-cli the error is

[default@cf] get
token['fbc1e9f7cc2c0c2fa186138ed28e5f691613409c0bcff648c651ab1f79f9600b'];
= (column=client_id, value=8ec4c29de726ad4db3f89a44cb07909c04f90932d,
timestamp=1355836425784329, ttl=648000)
A long is exactly 8 bytes: 10




-Bryan





On Mon, Dec 17, 2012 at 7:33 AM, Brian Tarbox tar...@cabotresearch.comwrote:

 I was using jre-7u9-linux-x64  which was the latest at the time.

 I'll confess that I did not file any bugs...at the time the advice from
 both the Cassandra and Zookeeper lists was to stay away from Java 7 (and my
 boss had had enough of my reporting that *the problem was Java 7* for
 me to spend a lot more time getting the details).

 Brian


 On Sun, Dec 16, 2012 at 4:54 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Sat, Dec 15, 2012 at 7:12 PM, Michael Kjellman 
 mkjell...@barracuda.com wrote:

 What issues have you ran into? Actually curious because we push
 1.1.5-7 really hard and have no issues whatsoever.


 A related question is which which version of java 7 did you try? The
 first releases of java 7 were apparently famous for having many issues but
 it seems the more recent updates are much more stable.

 --
 Sylvain


 On Dec 15, 2012, at 7:51 AM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 We've reverted all machines back to Java 6 after running into numerous
 Java 7 issues...some running Cassandra, some running Zookeeper, others just
 general problems.  I don't recall any other major language release being
 such a mess.


 On Fri, Dec 14, 2012 at 5:07 PM, Bill de hÓra b...@dehora.net wrote:

 At least that would be one way of defining officially supported.

 Not quite, because, Datastax is not Apache Cassandra.

 the only issue related to Java 7 that I know of is CASSANDRA-4958, but
 that's osx specific (I wouldn't advise using osx in production anyway) and
 it's not directly related to Cassandra anyway so you can easily use the
 beta version of snappy-java as a workaround if you want to. So that non
 blocking issue aside, and as far as we know, Cassandra supports Java 7. Is
 it rock-solid in production? Well, only repeated use in production can
 tell, and that's not really in the hand of the project.

 Exactly right. If enough people use Cassandra on Java7 and enough
 people file bugs about Java 7 and enough people work on bugs for Java 7
 then Cassandra will eventually work well enough on Java7.

 Bill

 On 14 Dec 2012, at 19:43, Drew Kutcharian d...@venarc.com wrote:

  In addition, the DataStax official documentation states: Versions
 earlier than 1.6.0_19 should not be used. Java 7 is not recommended.
 
  http://www.datastax.com/docs/1.1/install/install_rpm
 
 
 
  On Dec 14, 2012, at 9:42 AM, Aaron Turner synfina...@gmail.com
 wrote:
 
  Does Datastax (or any other company) support Cassandra under Java 7?
  Or will they tell you to downgrade when you have some problem,
 because
  they don't support C* running on 7?
 
  At least that would be one way of defining officially supported.
 
  On Fri, Dec 14, 2012 at 2:22 AM, Sylvain Lebresne 
 sylv...@datastax.com wrote:
  What kind of official statement do you want? As far as I can be
 considered
  an official voice of the project, my statement is: various people
 run in
  production with Java 7 and it seems to work.
 
  Or to answer the initial question, the only issue related to Java 7
 that I
  know of is CASSANDRA-4958, but that's osx specific (I wouldn't
 advise using
  osx in production anyway) and it's not directly related to
 Cassandra anyway
  so you can easily use the beta version of snappy-java as a
 workaround if you
  want to. So that non blocking issue aside, and as far as we know,
 Cassandra
  supports Java 7. Is it rock-solid

Re: CQL timestamps and timezones

2012-12-07 Thread Bryan Talbot

With 1.1.5, the TS is displayed with the local timezone and seems correct.

cqlsh:bat create table test (id uuid primary key, ts timestamp );
cqlsh:bat insert into test (id,ts) values (
'89d09c88-40ac-11e2-a1e2-6067201fae78',  '2012-12-07T10:00:00-');
cqlsh:bat select * from test;
 id   | ts
--+--
 89d09c88-40ac-11e2-a1e2-6067201fae78 | 2012-12-07 02:00:00-0800

cqlsh:bat


-Bryan


On Fri, Dec 7, 2012 at 1:14 PM, B. Todd Burruss bto...@gmail.com wrote:

 trying to figure out if i'm doing something wrong or a bug.  i am
 creating a simple schema, inserting a timestamp using ISO8601 format,
 but when retrieving the timestamp, the timezone is displayed
 incorrectly.  i'm inserting using GMT, the result is shown with
 +, but the time is for my local timezone (-0800)

 tried with 1.1.6 (DSE 2.2.1), and 1.2.0-rc1-SNAPSHOT

 here's the trace:

 bin/cqlsh
 Connected to Test Cluster at localhost:9160.
 [cqlsh 2.3.0 | Cassandra 1.2.0-rc1-SNAPSHOT | CQL spec 3.0.0 | Thrift
 protocol 19.35.0]
 Use HELP for help.
 cqlsh CREATE KEYSPACE btoddb WITH replication =
 {'class':'SimpleStrategy', 'replication_factor':1};
 cqlsh
 cqlsh USE btoddb;
 cqlsh:btoddb CREATE TABLE test (
   ...   id uuid PRIMARY KEY,
   ...   ts TIMESTAMP
   ... );
 cqlsh:btoddb
 cqlsh:btoddb INSERT INTO test
   ...   (id, ts)
   ...   values (
   ... '89d09c88-40ac-11e2-a1e2-6067201fae78',
   ... '2012-12-07T10:00:00-'
   ...   );
 cqlsh:btoddb
 cqlsh:btoddb SELECT * FROM test;

  id   | ts
 --+--
  89d09c88-40ac-11e2-a1e2-6067201fae78 | 2012-12-07 02:00:00+

 cqlsh:btoddb

Re: need some help with row cache

2012-11-28 Thread Bryan Talbot

The row cache itself is global and the size is set with
row_cache_size_in_mb.  It must be enabled per CF using the proper
settings.  CQL3 isn't complete yet in C* 1.1 so if the cache settings
aren't shown there, then you'll probably need to use cassandra-cli.

-Bryan


On Tue, Nov 27, 2012 at 10:41 PM, Wz1975 wz1...@yahoo.com wrote:
 Use cassandracli.


 Thanks.
 -Wei

 Sent from my Samsung smartphone on ATT


  Original message 
 Subject: Re: need some help with row cache
 From: Yiming Sun yiming@gmail.com
 To: user@cassandra.apache.org
 CC:


 Also, what command can I used to see the caching setting?  DESC TABLE
 cf doesn't list caching at all.  Thanks.

 -- Y.


 On Wed, Nov 28, 2012 at 12:15 AM, Yiming Sun yiming@gmail.com wrote:

 Hi Bryan,

 Thank you very much for this information.  So in other words, the settings
 such as row_cache_size_in_mb in YAML alone are not enough, and I must also
 specify the caching attribute on a per column family basis?

 -- Y.


 On Tue, Nov 27, 2012 at 11:57 PM, Bryan Talbot btal...@aeriagames.com
 wrote:

 On Tue, Nov 27, 2012 at 8:16 PM, Yiming Sun yiming@gmail.com wrote:
  Hello,
 
  but it is not clear to me where this setting belongs to, because even
  in the
  v1.1.6 conf/cassandra.yaml,  there is no such property, and apparently
  adding this property to the yaml causes a fatal configuration error
  upon
  server startup,
 

 It's a per column family setting that can be applied using the CLI or
 CQL.

 With CQL3 it would be

 ALTER TABLE cf WITH caching = 'rows_only';

 to enable the row cache but no key cache for that CF.

 -Bryan

Re: outOfMemory error

2012-11-28 Thread Bryan Talbot

Well, asking for 500MB of data at once for a server with such modest
specs is asking for troubles.  Here are my suggestions.

Disable the 1 GB row cache
Consider allocating that memory for the java heap Xms2500m Xmx2500m
Don't fetch all the columns at once -- page through them a slice at a time
Increase the memtable to more than 64 MB if you want to write data to
this cluster

-Bryan



On Wed, Nov 28, 2012 at 5:06 AM, Damien Lejeune d.leje...@pepite.be wrote:
 Hi all,

 I'm currently experiencing a outOfMemory problem with Cassandra-1.1.6 on
 Windows XP-Pro (32-bit). The server crashes when I try to query it with a
 relatively small amount of data (around 100 rows with 5 columns each: to
 be precise, on my configuration, querying 75 or more rows makes the server
 to crash).
 I tried with different library (Hector, JDBC, Thrift) and with the Cassandra
 stress tool. All lead to the same outOfMemory problem.

 My dataset is composed, for each row, of: 1 column in DateType, 4
 columns in DoubleType. I ran a query to fetch the entire dataset (around
 330MB for the raw data + around 200MB for the metadata) and got the log at
 the end of this message.

 I also checked the heap-dump with Mat which displays these top values:
 Class Name
  Objects  Shallow Heap
 java.nio.HeapByteBuffer   16,253,559
 780,170,832
 bytes[]   16,254,013
 330,207,640 -- Data ?
 java.util.TreeMap$Entry8,126,711
 260,054,752
 org.apache.cassandra.db.Column 8,116,589
 194,798,136 -- Metadata ?

 I tried to change the configuration in Cassandra for the values:
 - row_cache_size_in_mb: tried different value between [0,1000] MB
 - flush_largest_memtables_at: set to 0.1, but tried with 0.75
 - reduce_cache_sizes_at: tried 0.85, 0.6, 0.2 and 0.1
 - reduce_cache_capacity_to: tried 0.6 and 0.15
 - memtable_total_space_in_mb: 64 MB, but also tried to disable it (- 1/3 of
 the heap)
 - Xms1G
 - Xmx1500M
 with no real observable improvements regarding my problem.

 My Cassandra server and client both run on the same machine.

 Here are the characteristics of my system configuration:
 - Cassandra-1.1.6
 - java version 1.6.0_20
  Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
  Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
 - Windows XP-Pro 32 bits with service pack 3
 - CPU double-core, 32 bits @2.26GHz
 - 3.48 of RAM

 I'm aware that my system configuration is not an optimized environment to
 make Cassandra to run efficiently, but I wonder if you guys know a
 workaround (or any idea on how) to fix this problem. Part of the answer is
 probably that I do not have enough RAM to run the process, but I also wonder
 if it is a 'normal' behaviour for Cassandra to handle this particular test
 case that way.

 Cheers,

 Damien

  Cassandra's LOG ---

 Starting Cassandra Server
  INFO 09:10:27,171 Logging initialized
  INFO 09:10:27,171 JVM vendor/version: Java HotSpot(TM) Client VM/1.6.0_18
  INFO 09:10:27,171 Heap size: 1072103424/1569521664
  INFO 09:10:27,171 Classpath:

Re: need some help with row cache

2012-11-27 Thread Bryan Talbot

On Tue, Nov 27, 2012 at 8:16 PM, Yiming Sun yiming@gmail.com wrote:
 Hello,

 but it is not clear to me where this setting belongs to, because even in the
 v1.1.6 conf/cassandra.yaml,  there is no such property, and apparently
 adding this property to the yaml causes a fatal configuration error upon
 server startup,


It's a per column family setting that can be applied using the CLI or CQL.

With CQL3 it would be

ALTER TABLE cf WITH caching = 'rows_only';

to enable the row cache but no key cache for that CF.

-Bryan

Re: Admin for cassandra?

2012-11-16 Thread Bryan Talbot

The https://github.com/sebgiroux/Cassandra-Cluster-Admin app does some
of what you're asking.  It allows basic browsing and some admin
functionality.  If you want to run actual CQL queries though, you
currently need to use another app for that (like cqlsh).

-Bryan


On Thu, Nov 15, 2012 at 11:30 PM, Timmy Turner timm.t...@gmail.com wrote:
 I think an eclipse plugin would be the wrong way to go here. Most people
 probably just want to browse through the columnfamilies and see whether
 their queries work out or not. This functionality is imho best implemented
 as some form of a light-weight editor, not a full blown IDE.

 I do have something of this kind scheduled as small part of a larger project
 (seeing as how there is currently no properly working tool that provides
 this functionality), but concrete results are probably still a few months
 out..


 2012/11/16 Edward Capriolo edlinuxg...@gmail.com

 We should build an eclipse plugin named Eclipsandra or something.

 On Thu, Nov 15, 2012 at 9:45 PM, Wz1975 wz1...@yahoo.com wrote:
  Cqlsh is probably the closest you will get. Or pay big bucks to hire
  someone
  to develop one for you:)
 
 
  Thanks.
  -Wei
 
  Sent from my Samsung smartphone on ATT
 
 
   Original message 
  Subject: Admin for cassandra?
  From: Kevin Burton rkevinbur...@charter.net
  To: user@cassandra.apache.org
  CC:
 
 
  Is there an IDE for a Cassandra database? Similar to the SQL Server
  Management Studio for SQL server. I mainly want to execute queries and
  see
  the results. Preferably that runs under a Windows OS.
 
 
 
  Thank you.

Re: How to upgrade a ring (0.8.9 nodes) to 1.1.5 with the minimal downtime?

2012-11-05 Thread Bryan Talbot

Do a rolling upgrade of the ring to 1.0.12 first and then upgrade to 1.1.x.
 After each rolling upgrade, you should probably do the recommend nodetool
upgradesstables, etc.  The datastax documentation about upgrading might be
helpful for you: http://www.datastax.com/docs/1.1/install/upgrading

-Bryan


On Mon, Nov 5, 2012 at 10:55 AM, Yan Wu y...@prospricing.com wrote:

 Hello,

 I have a Cassandra ring with 4 nodes in 0.8.9 and like to upgrade all
 nodes to 1.1.5.
 It would be great that the upgrade has no downtime or minimal downtime of
 the ring.
 After I brought down one of the nodes and upgraded it to 1.1.5, when I
 tried to bring it up,
 the new 1.1.5 node looks good but the rest of three 0.8.9 nodes started
 throwing exceptions:
 ---
 Fatal exception in thread Thread[GossipStage:2,5,main]
 java.lang.UnsupportedOperationException: Not a time-based UUID
 at
 org.apache.cassandra.service.MigrationManager.rectify(MigrationManager.java:92)
 at
 org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:75)
 at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:707)
 at
 org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:750)
 at
 org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:809)
 at
 org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 
 Then later
 
 ERROR 12:03:20,925 Fatal exception in thread Thread[HintedHandoff:1,1,main]
 java.lang.RuntimeException: java.lang.RuntimeException: Could not reach
 schema agreement with /xx.xx.xx.xx in 6ms
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.lang.RuntimeException: Could not reach schema agreement
 with /xx.xx.xx.xx in 6ms
 at
 org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:293)
 at
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:304)
 at
 org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:89)
 at
 org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:397)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 ... 3 more
 

 Any suggestions?   Thanks in advance.

 Yan

Re: repair, compaction, and tombstone rows

2012-11-05 Thread Bryan Talbot

As the OP of this thread, it is a big itch for my use case.  Repair ends up
streaming tens of gigabytes of data which has expired TTL and has been
compacted away on some nodes but not yet on others.  The wasted work is not
nice plus it drives up the memory usage (for bloom filters, indexes, etc)
of all nodes since there are many more rows to track than planned.
 Disabling the periodic repair lowered the per-node load by 100GB which was
all dead data in my case.

-Bryan


On Mon, Nov 5, 2012 at 5:12 PM, horschi hors...@gmail.com wrote:



 That's true, we could just create an already gcable tombstone. It's a bit
 of an abuse of the localDeletionTime but why not. Honestly a good part of
 the reason we haven't done anything yet is because we never really had
 anything for which tombstones of expired columns where a big pain point.
 Again, feel free to open a ticket (but what we should do is retrieve the
 ttl from the localExpirationTime when creating the tombstone, not using the
 creation time (partly because that creation time is a user provided
 timestamp so we can't use it, and because we must still keep tombstones if
 the ttl  gcGrace)).


 Created CASSANDRA-4917. I changed the example implementation to use
 (localExpirationTime-timeToLive) for the tombstone. I agree this is not the
 biggest itch to scratch. But it might save a few seeks here and there :-)


 Did you also have a look at DeletedColumn? It uses the updateDigest
 implementation from its parent class, which applies also the value to the
 digest. Unfortunetaly the value is the localDeletionTime, which is being
 generated on each node individually, right? (at RowMutation.delete)
 The resolution of the time is low, so there is a good chance the
 timestamps will match on all nodes, but that should be nothing to rely on.


 cheers,
 Christian

Re: repair, compaction, and tombstone rows

2012-11-01 Thread Bryan Talbot

It seems like CASSANDRA-3442 might be an effective fix for this issue
assuming that I'm reading it correctly.  It sounds like the intent is to
automatically compact SSTables when a certain percent of the columns are
gcable by being deleted or with expired tombstones.  Is my understanding
correct?

Would such tables be compacted individually (1-1) or are several eligible
tables selected and compacted using the STCS compaction threshold bounds?

-Bryan


On Thu, Nov 1, 2012 at 9:43 AM, Rob Coli rc...@palominodb.com wrote:

 On Thu, Nov 1, 2012 at 1:43 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:
  on all your columns), you may want to force a compaction (using the
  JMX call forceUserDefinedCompaction()) of that sstable. The goal being
  to get read of a maximum of outdated tombstones before running the
  repair (you could also alternatively run a major compaction prior to
  the repair, but major compactions have a lot of nasty effect so I
  wouldn't recommend that a priori).

 If sstablesplit (reverse compaction) existed, major compaction would
 be a simple solution to this case. You'd major compact and then split
 your One Giant SSTable With No Tombstones into a number of smaller
 ones. :)

 https://issues.apache.org/jira/browse/CASSANDRA-4766

 =Rob

 --
 =Robert Coli
 AIMGTALK - rc...@palominodb.com
 YAHOO - rcoli.palominob
 SKYPE - rcoli_palominodb

Re: Cassandra upgrade issues...

2012-11-01 Thread Bryan Talbot

Note that 1.0.7 came out before 1.1 and I know there were
some compatibility issues that were fixed in later 1.0.x releases which
could affect your upgrade.  I think it would be best to first upgrade to
the latest 1.0.x release, and then upgrade to 1.1.x from there.

-Bryan



On Thu, Nov 1, 2012 at 1:27 AM, Brian Fleming bigbrianflem...@gmail.comwrote:

 Hi Sylvain,

 Simple as that!!!  Using the 1.1.5 nodetool version works as expected.  My
 mistake.

 Many thanks,

 Brian




 On Thu, Nov 1, 2012 at 8:24 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 The first thing I would check is if nodetool is using the right jar. I
 sounds a lot like if the server has been correctly updated but
 nodetool haven't and still use the old classes.
 Check the nodetool executable, it's a shell script, and try echoing
 the CLASSPATH in there and check it correctly point to what it should.

 --
 Sylvain

 On Thu, Nov 1, 2012 at 9:10 AM, Brian Fleming bigbrianflem...@gmail.com
 wrote:
  Hi,
 
 
 
  I was testing upgrading from Cassandra v.1.0.7 to v.1.1.5 yesterday on a
  single node dev cluster with ~6.5GB of data  it went smoothly in that
 no
  errors were thrown, the data was migrated to the new directory
 structure, I
  can still read/write data as expected, etc.  However nodetool commands
 are
  behaving strangely – full details below.
 
 
 
  I couldn’t find anything relevant online relating to these exceptions –
 any
  help/pointers would be greatly appreciated.
 
 
 
  Thanks  Regards,
 
 
 
  Brian
 
 
 
 
 
 
 
 
 
  ‘nodetool cleanup’ runs successfully
 
 
 
  ‘nodetool info’ produces :
 
 
 
  Token: 82358484304664259547357526550084691083
 
  Gossip active: true
 
  Load : 7.69 GB
 
  Generation No: 1351697611
 
  Uptime (seconds) : 58387
 
  Heap Memory (MB) : 936.91 / 1928.00
 
  Exception in thread main java.lang.ClassCastException:
 java.lang.String
  cannot be cast to org.apache.cassandra.dht.Token
 
  at
  org.apache.cassandra.tools.NodeProbe.getEndpoint(NodeProbe.java:546)
 
  at
  org.apache.cassandra.tools.NodeProbe.getDataCenter(NodeProbe.java:559)
 
  at
 org.apache.cassandra.tools.NodeCmd.printInfo(NodeCmd.java:313)
 
  at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:651)
 
 
 
  ‘nodetool repair’ produces :
 
  Exception in thread main
 java.lang.reflect.UndeclaredThrowableException
 
  at $Proxy0.forceTableRepair(Unknown Source)
 
  at
 
 org.apache.cassandra.tools.NodeProbe.forceTableRepair(NodeProbe.java:203)
 
  at
  org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:880)
 
  at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:719)
 
  Caused by: javax.management.ReflectionException: Signature mismatch for
  operation forceTableRepair: (java.lang.String, [Ljava.lang.String;)
 should
  be (java.lang.String, boolean, [Ljava.lang.String;)
 
  at
  com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:152)
 
  at
  com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:117)
 
  at
  com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
 
  at
 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
 
  at
  com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
 
  at
 
 javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
 
  at
 
 javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
 
  at
 
 javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
 
  at
 
 javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
 
  at
 
 javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
  at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 
  at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
  at java.lang.reflect.Method.invoke(Method.java:597)
 
  at
  sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
 
  at sun.rmi.transport.Transport$1.run(Transport.java:159)
 
  at java.security.AccessController.doPrivileged(Native Method)
 
  at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
 
  at
  sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
 
  at
 
 sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
 
  at
 
 sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
 
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
  at

repair, compaction, and tombstone rows

2012-10-31 Thread Bryan Talbot

I've been experiencing a behavior that is undesirable and it seems like a
bug that causes a high amount of wasted work.

I have a CF where all columns have a TTL, are generally all inserted in a
very short period of time (less than a second) and are never over-written
or explicitly deleted.  Eventually one node will run a compaction and
remove rows containing only tombstones greater than gc_grace_seconds old
which is expected.

The problem comes up when a repair is run.  During the repair the other
nodes that haven't run a compaction and still have the tombstoned rows
fix the inconsistency and stream the rows (which contain only a tombstone
which is more than gc_grace_seconds old) back to the node which had
compacted that row away.  This ends up occurring over and over and uses a
lot of time, storage, and bandwidth to keep repairing rows that are
intentionally missing.

I think the issue stems from the behavior of compaction of TTL rows and
repair.  The compaction of TTL rows is a node-local event which will
eventually cause tombstoned rows to disappear from the one node doing the
compaction and then get repaired from replicas later.  I guess this could
happen for rows which are explicitly deleted as well.

Is this a feature or a bug?  How can I avoid repair of rows that were
correctly removed via compaction from one node but not from replicas just
because compactions run independently on each node?  Every repair ends up
streaming tens of gigabytes of missing rows to and from replicas.

Cassandra 1.1.5 with size tiered compaction strategy and RF=3

-Bryan

Re: constant CMS GC using CPU time

2012-10-25 Thread Bryan Talbot

On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.comwrote:

This sounds very much like my heap is so consumed by (mostly) bloom
filters that I am in steady state GC thrash.

Yes, I think that was at least part of the issue.

The rough numbers I've used to estimate working set are:

* bloom filter size for 400M rows at 0.00074 fp without java fudge (they
are just a big array) 714 MB
* memtable size 1024 MB
* index sampling:
* 24 bytes + key (16 bytes for UUID) = 32 bytes
* 400M / 128 default sampling = 3,125,000
* 3,125,000 * 32 = 95 MB
* java fudge X5 or X10 = 475MB to 950MB
* ignoring row cache and key cache

So the high side number is 2213 to 2,688. High because the fudge is a
delicious sticky guess and the memtable space would rarely be full.

On a 5120 MB heap, with 800MB new you have roughly 4300 MB tenured (some
goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on
the working set and how quickly stuff get's tenured which depends on the
workload.

These values seem reasonable and in line with what I was seeing. There are
other CF and apps sharing this cluster but this one was the largest.

You can confirm these guesses somewhat manually by enabling all the GC
logging in cassandra-env.sh. Restart the node and let it operate normally,
probably best to keep repair off.

I was using jstat to monitor gc activity and some snippets from that are in
my original email in this thread. The key behavior was that full gc was
running pretty often and never able to reclaim much (if any) space.

There are a few things you could try:

* increase the JVM heap by say 1Gb and see how it goes
* increase bloom filter false positive, try 0.1 first (see
http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance
)
* increase index_interval sampling in yaml.
* decreasing compaction_throughput and in_memory_compaction_limit can
lesson the additional memory pressure compaction adds.
* disable caches or ensure off heap caches are used.

I've done several of these already in addition to changing the app to
reduce the number of rows retained. How does compaction_throughput relate
to memory usage? I assumed that was more for IO tuning. I noticed that
lowering concurrent_compactors to 4 (from default of 8) lowered the memory
used during compactions. in_memory_compaction_limit_in_mb seems to only be
used for wide rows and this CF didn't have any wider
than in_memory_compaction_limit_in_mb. My multithreaded_compaction is
still false.

Watching the gc logs and the cassandra log is a great way to get a feel
for what works in your situation. Also take note of any scheduled
processing your app does which may impact things, and look for poorly
performing queries.

Finally this book is a good reference on Java GC
http://amzn.com/0137142528

For my understanding what was the average row size for the 400 million
keys ?

The compacted row mean size for the CF is 8815 (as reported by cfstats) but
that comes out to be much larger than the real load per node I was seeing.
Each node had about 200GB of data for the CF with 4 nodes in the cluster
and RF=3. At the time, the TTL for all columns was 3 days and
gc_grace_seconds was 5 days. Since then I've reduced the TTL to 1 hour and
set gc_grace_seconds to 0 so the number of rows and data dropped to a level
it can handle.

-Bryan

Re: constant CMS GC using CPU time

2012-10-24 Thread Bryan Talbot

On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote:

 On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com
 wrote:
  The nodes with the most data used the most memory.  All nodes are
 affected
  eventually not just one.  The GC was on-going even when the nodes were
 not
  compacting or running a heavy application load -- even when the main app
 was
  paused constant the GC continued.

 This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.


Yes, I think that was at least part of the issue.




 Do you have heap graphs which show a healthy sawtooth GC cycle which
 then more or less flatlines?



I didn't save any graphs but that is what they would look like.  I was
using jstat to monitor gc activity.

-Bryan

Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot

These GC settings are the default (recommended?) settings from
cassandra-env.  I added the UseCompressedOops.

-Bryan


On Mon, Oct 22, 2012 at 6:15 PM, Will @ SOHO w...@voodoolunchbox.comwrote:

  On 10/22/2012 09:05 PM, aaron morton wrote:

  # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
  JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
 JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops

  You are too far behind the reference JVM's. Parallel GC is the preferred
 and highest performing form in the current Security Baseline version of the
 JVM's.




-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot

On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote:

 The GC was on-going even when the nodes were not compacting or running a
 heavy application load -- even when the main app was paused constant the GC
 continued.

 If you restart a node is the onset of GC activity correlated to some event?


Yes and no.  When the nodes were generally under the
.75 occupancy threshold a weekly repair -pr job would cause them to go
over the threshold and then stay there even after the repair had completed
and there were no ongoing compactions.  It acts as though at least some
substantial amount of memory used during repair was never dereferenced once
the repair was complete.

Once one CF in particular grew larger the constant GC would start up pretty
soon (less than 90 minutes) after a node restart even without a repair.






 As a test we dropped the largest CF and the memory
 usage immediately dropped to acceptable levels and the constant GC stopped.
  So it's definitely related to data load.  memtable size is 1 GB, row cache
 is disabled and key cache is small (default).

 How many keys did the CF have per node?
 I dismissed the memory used to  hold bloom filters and index sampling.
 That memory is not considered part of the memtable size, and will end up in
 the tenured heap. It is generally only a problem with very large key counts
 per node.


I've changed the app to retain less data for that CF but I think that it
was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
are write-once, never updated, and rarely read.  There are no secondary
indexes for this particular CF.




  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
 the default bloom_filter_fp_chance defaults to 0.0

 The default should be 0.000744.

 If the chance is zero or null this code should run when a new SSTable is
 written
   // paranoia -- we've had bugs in the thrift - avro - CfDef dance
 before, let's not let that break things
 logger.error(Bloom filter FP chance of zero isn't
 supposed to happen);

 Were the CF's migrated from an old version ?


Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
1.1.5 with a upgradesstables run at each upgrade along the way.

I could not find a way to view the current bloom_filter_fp_chance settings
when they are at a default value.  JMX reports the actual fp rate and if a
specific rate is set for a CF that shows up in describe table but I
couldn't find out how to tell what the default was.  I didn't inspect the
source.



 Is there any way to predict how much memory the bloom filters will consume
 if the size of the row keys, number or rows is known, and fp chance is
 known?


 See o.a.c.utils.BloomFilter.getFilter() in the code
 This http://hur.st/bloomfilter appears to give similar results.




Ahh, very helpful.  This indicates that 714MB would be used for the bloom
filter for that one CF.

JMX / cfstats reports Bloom Filter Space Used but the MBean method name
(getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
on-disk and in-memory space used is similar then summing up all the Bloom
Filter Space Used says they're currently consuming 1-2 GB of the heap
which is substantial.

If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It
just means more trips to SSTable indexes for a read correct?  Trade RAM for
time (disk I/O).

-Bryan

Re: constant CMS GC using CPU time

2012-10-22 Thread Bryan Talbot

The memory usage was correlated with the size of the data set.  The nodes
were a bit unbalanced which is normal due to variations in compactions.
 The nodes with the most data used the most memory.  All nodes are affected
eventually not just one.  The GC was on-going even when the nodes were not
compacting or running a heavy application load -- even when the main app
was paused constant the GC continued.

As a test we dropped the largest CF and the memory
usage immediately dropped to acceptable levels and the constant GC stopped.
 So it's definitely related to data load.  memtable size is 1 GB, row cache
is disabled and key cache is small (default).

I believe one culprit turned out to be the bloom filters.  They were 2+ GB
(as reported by nodetool cfstats anyway).  It looks like the default
bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as
the minimum value.  Raising that to 0.20 for some write-mostly CF reduced
memory used by 1GB or so.

Is there any way to predict how much memory the bloom filters will consume
if the size of the row keys, number or rows is known, and fp chance is
known?

-Bryan



On Mon, Oct 22, 2012 at 12:25 AM, aaron morton aa...@thelastpickle.comwrote:

 If you are using the default settings I would try to correlate the GC
 activity with some application activity before tweaking.

 If this is happening on one machine out of 4 ensure that client load is
 distributed evenly.

 See if the raise in GC activity us related to Compaction, repair or an
 increase in throughput. OpsCentre or some other monitoring can help with
 the last one. Your mention of TTL makes me think compaction may be doing a
 bit of work churning through rows.

 Some things I've done in the past before looking at heap settings:
 * reduce compaction_throughput to reduce the memory churn
 * reduce in_memory_compaction_limit
 * if needed reduce concurrent_compactors

 Currently it seems like the memory used scales with the amount of bytes
 stored and not with how busy the server actually is.  That's not such a
 good thing.

 The memtable_total_space_in_mb in yaml tells C* how much memory to devote
 to the memtables. That with the global row cache setting says how much
 memory will be used with regard to storing data and it will not increase
 inline with the static data load.

 Now days GC issues are typically due to more dynamic forces, like
 compaction, repair and throughput.

 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 20/10/2012, at 6:59 AM, Bryan Talbot btal...@aeriagames.com wrote:

 ok, let me try asking the question a different way ...

 How does cassandra use memory and how can I plan how much is needed?  I
 have a 1 GB memtable and 5 GB total heap and that's still not enough even
 though the number of concurrent connections and garbage generation rate is
 fairly low.

 If I were using mysql or oracle, I could compute how much memory could be
 used by N concurrent connections, how much is allocated for caching, temp
 spaces, etc.  How can I do this for cassandra?  Currently it seems like the
 memory used scales with the amount of bytes stored and not with how busy
 the server actually is.  That's not such a good thing.

 -Bryan



 On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.comwrote:

 In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
 (64-bit), the nodes are often getting stuck in state where CMS
 collections of the old space are constantly running.

 The JVM configuration is using the standard settings in cassandra-env --
 relevant settings are included below.  The max heap is currently set to 5
 GB with 800MB for new size.  I don't believe that the cluster is overly
 busy and seems to be performing well enough other than this issue.  When
 nodes get into this state they never seem to leave it (by freeing up old
 space memory) without restarting cassandra.  They typically enter this
 state while running nodetool repair -pr but once they start doing this,
 restarting them only fixes it for a couple of hours.

 Compactions are completing and are generally not queued up.  All CF are
 using STCS.  The busiest CF consumes about 100GB of space on disk, is write
 heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
 including those used for system keyspace and secondary indexes.  The number
 of SSTables per node currently varies from 185-212.

 Other than frequent log warnings about GCInspector  - Heap is 0.xxx
 full... and StorageService  - Flushing CFS(...) to relieve memory
 pressure there are no other log entries to indicate there is a problem.

 Does the memory needed vary depending on the amount of data stored?  If
 so, how can I predict how much jvm space is needed?  I don't want to make
 the heap too large as that's bad too.  Maybe there's a memory leak related
 to compaction that doesn't allow meta-data to be purged?


 -Bryan


 12 GB of RAM

Re: constant CMS GC using CPU time

2012-10-19 Thread Bryan Talbot

ok, let me try asking the question a different way ...

How does cassandra use memory and how can I plan how much is needed?  I
have a 1 GB memtable and 5 GB total heap and that's still not enough even
though the number of concurrent connections and garbage generation rate is
fairly low.

If I were using mysql or oracle, I could compute how much memory could be
used by N concurrent connections, how much is allocated for caching, temp
spaces, etc.  How can I do this for cassandra?  Currently it seems like the
memory used scales with the amount of bytes stored and not with how busy
the server actually is.  That's not such a good thing.

-Bryan



On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.comwrote:

 In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
 (64-bit), the nodes are often getting stuck in state where CMS
 collections of the old space are constantly running.

 The JVM configuration is using the standard settings in cassandra-env --
 relevant settings are included below.  The max heap is currently set to 5
 GB with 800MB for new size.  I don't believe that the cluster is overly
 busy and seems to be performing well enough other than this issue.  When
 nodes get into this state they never seem to leave it (by freeing up old
 space memory) without restarting cassandra.  They typically enter this
 state while running nodetool repair -pr but once they start doing this,
 restarting them only fixes it for a couple of hours.

 Compactions are completing and are generally not queued up.  All CF are
 using STCS.  The busiest CF consumes about 100GB of space on disk, is write
 heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
 including those used for system keyspace and secondary indexes.  The number
 of SSTables per node currently varies from 185-212.

 Other than frequent log warnings about GCInspector  - Heap is 0.xxx
 full... and StorageService  - Flushing CFS(...) to relieve memory
 pressure there are no other log entries to indicate there is a problem.

 Does the memory needed vary depending on the amount of data stored?  If
 so, how can I predict how much jvm space is needed?  I don't want to make
 the heap too large as that's bad too.  Maybe there's a memory leak related
 to compaction that doesn't allow meta-data to be purged?


 -Bryan


 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
 cache.
 $ free -m
  total   used   free sharedbuffers cached
 Mem: 12001  11870131  0  4   5778
 -/+ buffers/cache:   6087   5914
 Swap:0  0  0


 jvm settings in cassandra-env
 MAX_HEAP_SIZE=5G
 HEAP_NEWSIZE=800M

 # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
 JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops


 jstat shows about 12 full collections per minute with old heap usage
 constantly over 75% so CMS is always over the
 CMSInitiatingOccupancyFraction threshold.

 $ jstat -gcutil -t 22917 5000 4
 Timestamp S0 S1 E  O  P YGC YGCTFGC
  FGCT GCT
132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
 3078.941 3585.829
132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
 3079.220 3586.107
132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
 3079.583 3586.515
132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
 3079.853 3586.785


 Other hosts not currently experiencing the high CPU load have a heap less
 than .75 full.

 $ jstat -gcutil -t 6063 5000 4
 Timestamp S0 S1 E  O  P YGC YGCTFGC
  FGCT GCT
520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588
520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588
520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588
520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588

constant CMS GC using CPU time

2012-10-18 Thread Bryan Talbot

In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
(64-bit), the nodes are often getting stuck in state where CMS
collections of the old space are constantly running.

The JVM configuration is using the standard settings in cassandra-env --
relevant settings are included below.  The max heap is currently set to 5
GB with 800MB for new size.  I don't believe that the cluster is overly
busy and seems to be performing well enough other than this issue.  When
nodes get into this state they never seem to leave it (by freeing up old
space memory) without restarting cassandra.  They typically enter this
state while running nodetool repair -pr but once they start doing this,
restarting them only fixes it for a couple of hours.

Compactions are completing and are generally not queued up.  All CF are
using STCS.  The busiest CF consumes about 100GB of space on disk, is write
heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
including those used for system keyspace and secondary indexes.  The number
of SSTables per node currently varies from 185-212.

Other than frequent log warnings about GCInspector  - Heap is 0.xxx full...
and StorageService  - Flushing CFS(...) to relieve memory pressure there
are no other log entries to indicate there is a problem.

Does the memory needed vary depending on the amount of data stored?  If so,
how can I predict how much jvm space is needed?  I don't want to make the
heap too large as that's bad too.  Maybe there's a memory leak related to
compaction that doesn't allow meta-data to be purged?


-Bryan


12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
cache.
$ free -m
 total   used   free sharedbuffers cached
Mem: 12001  11870131  0  4   5778
-/+ buffers/cache:   6087   5914
Swap:0  0  0


jvm settings in cassandra-env
MAX_HEAP_SIZE=5G
HEAP_NEWSIZE=800M

# GC tuning options
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops


jstat shows about 12 full collections per minute with old heap usage
constantly over 75% so CMS is always over the
CMSInitiatingOccupancyFraction threshold.

$ jstat -gcutil -t 22917 5000 4
Timestamp S0 S1 E  O  P YGC YGCTFGC
 FGCT GCT
   132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
3078.941 3585.829
   132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
3079.220 3586.107
   132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
3079.583 3586.515
   132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
3079.853 3586.785


Other hosts not currently experiencing the high CPU load have a heap less
than .75 full.

$ jstat -gcutil -t 6063 5000 4
Timestamp S0 S1 E  O  P YGC YGCTFGC
 FGCT GCT
   520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
   520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
   520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
   520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
2130.779 3819.588

Re: hadoop consistency level

2012-10-18 Thread Bryan Talbot

I believe that reading with CL.ONE will still cause read repair to be run
(in the background) 'read_repair_chance' of the time.

-Bryan


On Thu, Oct 18, 2012 at 1:52 PM, Andrey Ilinykh ailin...@gmail.com wrote:

 On Thu, Oct 18, 2012 at 1:34 PM, Michael Kjellman
 mkjell...@barracuda.com wrote:
  Not sure I understand your question (if there is one..)
 
  You are more than welcome to do CL ONE and assuming you have hadoop nodes
  in the right places on your ring things could work out very nicely. If
 you
  need to guarantee that you have all the data in your job then you'll need
  to use QUORUM.
 
  If you don't specify a CL in your job config it will default to ONE (at
  least that's what my read of the ConfigHelper source for 1.1.6 shows)
 
 I have two questions.
 1. I can benefit from data locality (and Hadoop) only with CL ONE. Is
 it correct?
 2. With CL QUORUM cassandra reads data from all replicas. In this case
 Hadoop doesn't give me any  benefits. Application running outside the
 cluster has the same performance. Is it correct?

 Thank you,
   Andrey

Re: MBean cassandra.db.CompactionManager TotalBytesCompacted counts backwards

2012-10-08 Thread Bryan Talbot

I'm attempting to plot how busy the node is doing compactions but there
seems to only be a few metrics reported that might be suitable:
CompletedTasks, PendingTasks, TotalBytesCompacted,
TotalCompactionsCompleted.

It's not clear to me what the difference between CompletedTasks and
TotalCompactionsCompleted is but I am plotting TotalCompactionsCompleted /
sec as one metric; however, this rate is nearly always less than 1 and
doesn't capture how much resources are used doing the compaction.  A
compaction of 4 smallest SSTables counts the same as a compaction of 4
largest SSTables but the cost is hugely different.  Thus, I'm also plotting
TotalBytesCompacted / sec.

Since the TotalBytesCompacted value sometimes moves backwards I'm not
confident that it's reporting what it is meant to report.  The code and
comments indicate that it should only be incremented by the final size of
the newly created SSTable or by the bytes-compacted-so-far for a larger
compaction, so I don't see why it should be reasonable for it to sometimes
decrease.

How should the impact of compaction be measured if not by bytes compacted?

-Bryan


On Sun, Oct 7, 2012 at 7:39 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 I have not looked at this JMX object in a while, however the
 compaction manager can support multiple threads. Also it moves from
 0-filesize each time it has to compact a set of files.

 That is more useful for showing current progress rather then lifetime
 history.



 On Fri, Oct 5, 2012 at 7:27 PM, Bryan Talbot btal...@aeriagames.com
 wrote:
  I've recently added compaction rate (in bytes / second) to my monitors
 for
  cassandra and am seeing some odd values.  I wasn't expecting the values
 for
  TotalBytesCompacted to sometimes decrease from one reading to the next.
  It
  seems that the value should be monotonically increasing while a server is
  running -- obviously it would start again at 0 when the server is
 restarted
  or if the counter rolls over (unlikely for a 64 bit long).
 
  Below are two samples taken 60 seconds apart: the value decreased by
  2,954,369,012 between the two readings.
 
  reported_metric=[timestamp:1349476449, status:200,
  request:[mbean:org.apache.cassandra.db:type=CompactionManager,
  attribute:TotalBytesCompacted, type:read], value:7548675470069]
 
  previous_metric=[timestamp:1349476389, status:200,
  request:[mbean:org.apache.cassandra.db:type=CompactionManager,
  attribute:TotalBytesCompacted, type:read], value:7551629839081]
 
 
  I briefly looked at the code for CompactionManager and a few related
 classes
  and don't see anyplace that is performing subtraction explicitly;
 however,
  there are many additions of signed long values that are not validated and
  could conceivably contain a negative value thus causing the
  totalBytesCompacted to decrease.  It's interesting to note that the all
 of
  the differences I've seen so far are more than the overflow value of a
  signed 32 bit value.  The OS (CentOS 5.7) and sun java vm (1.6.0_29) are
  both 64 bit.  JNA is enabled.
 
  Is this expected and normal?  If so, what is the correct interpretation
 of
  this metric?  I'm seeing the negatives values a few times per hour when
  reading it once every 60 seconds.
 
  -Bryan
 




-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: what's the most 1.1 stable version?

2012-10-05 Thread Bryan Talbot

We've been using 1.1.5 for a few weeks now and it's been stable for our
uses.  Also, make sure you upgrade to a more recent version of 1.0 branch
before going to 1.1.  Version 1.0.7 was released before 1.1 and there are
upgrade-path fixed applied to 1.0 after that.  Our upgrade path was 1.0.9
- 1.0.11 - 1.1.5 which worked well.

-Bryan


On Fri, Oct 5, 2012 at 8:01 AM, Andrey Ilinykh ailin...@gmail.com wrote:

 In 1.1.5 file descriptor leak was fixed. In my case it was critical.
 Nodes went down every several days. But not everyone had this problem.

 Thank you,
   Andrey

 On Fri, Oct 5, 2012 at 7:42 AM, Alexandru Sicoe adsi...@gmail.com wrote:
  Hello,
   We are planning to upgrade from version 1.0.7 to the 1.1 branch. Which
 is
  the stable version that people are using? I see the latest release is
 1.1.5
  but maybe it's not fully wise to use this. Is 1.1.4 the one to use?
 
  Cheers,
  Alex




-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

MBean cassandra.db.CompactionManager TotalBytesCompacted counts backwards

2012-10-05 Thread Bryan Talbot

I've recently added compaction rate (in bytes / second) to my monitors for
cassandra and am seeing some odd values.  I wasn't expecting the values for
TotalBytesCompacted to sometimes decrease from one reading to the next.  It
seems that the value should be monotonically increasing while a server is
running -- obviously it would start again at 0 when the server is restarted
or if the counter rolls over (unlikely for a 64 bit long).

Below are two samples taken 60 seconds apart: the value decreased by
2,954,369,012 between the two readings.

reported_metric=[timestamp:1349476449, status:200,
request:[mbean:org.apache.cassandra.db:type=CompactionManager,
attribute:TotalBytesCompacted, type:read], value:7548675470069]

previous_metric=[timestamp:1349476389, status:200,
request:[mbean:org.apache.cassandra.db:type=CompactionManager,
attribute:TotalBytesCompacted, type:read], value:7551629839081]


I briefly looked at the code for CompactionManager and a few related
classes and don't see anyplace that is performing subtraction explicitly;
however, there are many additions of signed long values that are not
validated and could conceivably contain a negative value thus causing the
totalBytesCompacted to decrease.  It's interesting to note that the all of
the differences I've seen so far are more than the overflow value of a
signed 32 bit value.  The OS (CentOS 5.7) and sun java vm (1.6.0_29) are
both 64 bit.  JNA is enabled.

Is this expected and normal?  If so, what is the correct interpretation of
this metric?  I'm seeing the negatives values a few times per hour when
reading it once every 60 seconds.

-Bryan

is Not a time-based UUID serious?

2012-09-12 Thread Bryan Talbot

I'm testing upgrading a multi-node cluster from 1.0.9 to 1.1.5 and ran into
the error message described here:
https://issues.apache.org/jira/browse/CASSANDRA-4195

What I can't tell is if this is a serious issue or if it can be safely
ignored.

If it is a serious issue, shouldn't the migration guides for 1.1.x require
that upgrades cannot be rolling or that all nodes must be running 1.0.11 or
greater first?


2012-09-11 17:12:46,299 [GossipStage:1] ERROR
org.apache.cassandra.service.AbstractCassandraDaemon  - Fatal exception in
thread Thread[GossipStage:1,5,main]
java.lang.UnsupportedOperationException: Not a time-based UUID
at java.util.UUID.timestamp(UUID.java:308)
at
org.apache.cassandra.service.MigrationManager.updateHighestKnown(MigrationManager.java:121)
at
org.apache.cassandra.service.MigrationManager.rectify(MigrationManager.java:99)
at
org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:83)
at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:806)
at
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:849)
at
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:908)
at
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


-Bryan

Re: is Not a time-based UUID serious?

2012-09-12 Thread Bryan Talbot

To answer my own question: yes, the error is fatal.  This also means that
upgrades to 1.1.x from 1.0.x MUST use 1.0.11 or greater it seems to be
successful.

My test upgrade from 1.0.9 to 1.1.5 left the cluster in a state that wasn't
able to come to a schema agreement and blocked schema changes.

-Bryan


On Wed, Sep 12, 2012 at 2:42 PM, Bryan Talbot btal...@aeriagames.comwrote:

 I'm testing upgrading a multi-node cluster from 1.0.9 to 1.1.5 and ran
 into the error message described here:
 https://issues.apache.org/jira/browse/CASSANDRA-4195

 What I can't tell is if this is a serious issue or if it can be safely
 ignored.

 If it is a serious issue, shouldn't the migration guides for 1.1.x require
 that upgrades cannot be rolling or that all nodes must be running 1.0.11 or
 greater first?


 2012-09-11 17:12:46,299 [GossipStage:1] ERROR
 org.apache.cassandra.service.AbstractCassandraDaemon  - Fatal exception in
 thread Thread[GossipStage:1,5,main]
 java.lang.UnsupportedOperationException: Not a time-based UUID
 at java.util.UUID.timestamp(UUID.java:308)
 at
 org.apache.cassandra.service.MigrationManager.updateHighestKnown(MigrationManager.java:121)
 at
 org.apache.cassandra.service.MigrationManager.rectify(MigrationManager.java:99)
 at
 org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:83)
 at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:806)
 at
 org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:849)
 at
 org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:908)
 at
 org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)


 -Bryan

64 matches

Mail list logo