Major compaction does not seems to free the disk space a lot if wide rows are used.

2013-05-16 Thread Boris Yen
Hi All,

Sorry for the wide distribution.

Our cassandra is running on 1.0.10. Recently, we are facing a weird
situation. We have a column family containing wide rows (each row might
have a few million of columns). We delete the columns on a daily basis and
we also run major compaction on it everyday to free up disk space (the
gc_grace is set to 600 seconds).

However, every time we run the major compaction, only 1 or 2GB disk space
is freed. We tried to delete most of the data before running compaction,
however, the result is pretty much the same.

So, we tried to check the source code. It seems that the column tombstones
could only be purged when the row key is not in other sstables. I know the
major compaction should include all sstables, however, in our use case,
columns get inserted rapidly. This will make the cassandra flush the
memtables to disk and create new sstables. The newly created sstables will
have the same keys as the sstables that are being compacted (the compaction
will take 2 or 3 hours to finish). My question is that will these newly
created sstables be the cause of why most of the column-tombstone not being
purged?

p.s. We also did some other tests. We inserted data to the same CF with the
same wide-row pattern and deleted most of the data. This time we stopped
all the writes to cassandra and did the compaction. The disk usage
decreased dramatically.

Any suggestions or is this a know issue.

Thanks and Regards,
Boris


Re: Decommission nodes starts to appear from one node (1.0.11)

2013-05-16 Thread Roshan
I found this bug, seems it is fixed. But I can see that in my situation, the
decommission node still I can see from the JMX console LoadMap attribute.

Might this is the reason why hector says not enough replica??

Experts, any thoughts??

Thanks.



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Decommission-nodes-starts-to-appear-from-one-node-1-0-11-tp7587842p7587845.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Decommission nodes starts to appear from one node (1.0.11)

2013-05-16 Thread Alain RODRIGUEZ
Not sure to understand you correctly, but if you are dealing with ghost
nodes that you want to remove, I never saw a node that could resist to an
unsafeAssassinateEndpoint.

http://grokbase.com/t/cassandra/user/12b9eaaqq4/remove-crashed-node
http://grokbase.com/t/cassandra/user/133nmsm3hd/removing-old-nodes

I hope this will help, I have no clue on why this is happening, I am not
one of these experts you asked for ;-).

Alain




2013/5/16 Roshan codeva...@gmail.com

 I found this bug, seems it is fixed. But I can see that in my situation,
 the
 decommission node still I can see from the JMX console LoadMap attribute.

 Might this is the reason why hector says not enough replica??

 Experts, any thoughts??

 Thanks.



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Decommission-nodes-starts-to-appear-from-one-node-1-0-11-tp7587842p7587845.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



Re: (unofficial) Community Poll for Production Operators : Repair

2013-05-16 Thread Alain RODRIGUEZ
@Rob: Thanks about the feedback.

Yet I have a weird behavior still unexplained about repairing. Are counters
supposed to be repaired too ? I mean, while reading at CL.ONE I can have
different values depending on what node is answering. Even after a read
repair or a full repair. Shouldn't a repair fix these discrepancies ?

The only way I found to get always the same count is to read data at
CL.QUORUM, but this is a workaround since the data itself remains wrong on
some nodes.

Any clue on it ?

Alain

2013/5/15 Edward Capriolo edlinuxg...@gmail.com

 http://basho.com/introducing-riak-1-3/

 Introduced Active Anti-Entropy. Riak now has active anti-entropy. In
 distributed systems, inconsistencies can arise between replicas due to
 failure modes, concurrent updates, and physical data loss or corruption.
 Pre-1.3 Riak already had several features for repairing this “entropy”, but
 they all required some form of user intervention. Riak 1.3 introduces
 automatic, self-healing properties that repair entropy on an ongoing basis.


 On Wed, May 15, 2013 at 5:32 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:
  Rob, I was wondering something. Are you a commiter working on improving
 the
  repair or something similar ?

 I am not a committer [1], but I have an active interest in potential
 improvements to the best practices for repair. The specific change
 that I am considering is a modification to the default
 gc_grace_seconds value, which seems picked out of a hat at 10 days. My
 view is that the current implementation of repair has such negative
 performance consequences that I do not believe that holding onto
 tombstones for longer than 10 days could possibly be as bad as the
 fixed cost of running repair once every 10 days. I believe that this
 value is too low for a default (it also does not map cleanly to the
 work week!) and likely should be increased to 14, 21 or 28 days.

  Anyway, if a commiter (or any other expert) could give us some feedback
 on
  our comments (Are we doing well or not, whether things we observe are
 normal
  or unexplained, what is going to be improved in the future about
 repair...)

 1) you are doing things according to best practice
 2) unfortunately your experience with significantly degraded
 performance, including a blocked go-live due to repair bloat is pretty
 typical
 3) the things you are experiencing are part of the current
 implementation of repair and are also typical, however I do not
 believe they are fully explained [2]
 4) as has been mentioned further down thread, there are discussions
 regarding (and some already committed) improvements to both the
 current repair paradigm and an evolution to a new paradigm

 Thanks to all for the responses so far, please keep them coming! :D

 =Rob
 [1] hence the (unofficial) tag for this thread. I do have minor
 patches accepted to the codebase, but always merged by an actual
 committer. :)
 [2] driftx@#cassandra feels that these things are explained/understood
 by core team, and points to
 https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful
 approach to minimize same.





vnodes ready for production ?

2013-05-16 Thread Alain RODRIGUEZ
Hi,

Adding vnodes is a big improvement to Cassandra, specifically because we
have a fluctuating load on our Cassandra depending on the week, and it is
quite annoying to add some nodes for one week or two, move tokens and then
having to remove them and then move tokens again. Even more if we could
automate some up-scale thanks to AWS alarms, It would be awesome.

We don't use vnodes yet because Opscenter did not support this feature and
because we need to have a reliable production. Now Opscenter handles vnodes.

Are the vnodes feature and the tokens =vnodes transition safe enough to go
live with vnodes ?

What would be the transition process ?

Does someone auto-scale his Cassandra cluster ?

Any advice about vnodes ?


best practices on EC2 question

2013-05-16 Thread Brian Tarbox
From this list and the NYC* conference it seems that the consensus
configuration of C* on EC2 is to put the data on an ephemeral drive and
then periodically back it the drive to S3...relying on C*'s inherent fault
tolerance to deal with any data loss.

Fine, and we're doing this...but we find that transfer rates from S3 back
to a rebooted server instance are *very *slow...like 15 MB/second or
roughly a minute per gigabyte.  Calling EC2 support resulting in them
saying sorry, that's how it is.

I'm wondering if anyone a) has found a faster way to transfer to S3, or b)
do people skip backups altogether except for huge outages and just let
rebooted server instances come up empty to repopulate via C*?

An alternative that we had explored for a while was to do a two stage
backup:
1) copy a C* snapshot from the ephemeral drive to an EBS drive
2) do an EBS snapshot to S3.

The idea being that EBS is quite reliable, S3 is still the emergency backup
and copying back from EBS to ephemeral is likely much faster than the 15
MB/sec we get from S3.

Thoughts?

Brian


SSTable size versus read performance

2013-05-16 Thread Keith Wright
Hi all,

I currently have 2 clusters, one running on 1.1.10 using CQL2 and one 
running on 1.2.4 using CQL3 and Vnodes.   The machines in the 1.2.4 cluster are 
expected to have better IO performance as we are going from 1 SSD data disk per 
node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with 
higher end drives (commit logs are on their own disk shared with the OS).  I am 
doing some stress testing on the 1.2 cluster and have found that although the 
reads / sec as seen from iostat are approximately the same (3K / sec) in both 
clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as 
compared to 30-50 MB/s in 1.2).  As a result, I am seeing excessive iowait in 
the 1.2 cluster causing high average read times of 30 ms under the same load 
(1.1 cluster sees around 5 ms).  They are both using Leveled compaction but one 
thing I did change in the new cluster was to increase the sstable size from the 
OOTB setting to 32 MB.  Note that my reads are by definition highly random as 
we are running memcached in front for various reasons.  Does cassandra need to 
read the entire SSTable when fetching a row or only the relevant chunk (I have 
the OOTB chunk size and BF settings)?  I just decreased the sstable size to 5 
MB and am waiting for compactions to complete to see if that makes a difference.

Thanks!

Relevant table definition if helpful (note that I also changed to the LZ4 
compressor expecting better read performance and I decreased the crc change 
again to minimize read latency):

CREATE TABLE global_user (
user_id BIGINT,
app_id INT,
type TEXT,
name TEXT,
last TIMESTAMP,
paid BOOLEAN,
values mapTIMESTAMP,FLOAT,
sku_time mapTEXT,TIMESTAMP,
extra_param mapTEXT,TEXT,
PRIMARY KEY (user_id, app_id, type, name)
) with 
compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and
compaction={'class':'LeveledCompactionStrategy'} and
compaction_strategy_options = {'sstable_size_in_mb':5} and
gc_grace_seconds = 86400;


Re: SSTable size versus read performance

2013-05-16 Thread Edward Capriolo
I am not sure of the new default is to use compression, but I do not
believe compression is a good default. I find compression is better for
larger column families that are sparsely read. For high throughput CF's I
feel that decompressing larger blocks hurts performance more then
compression adds.


On Thu, May 16, 2013 at 10:14 AM, Keith Wright kwri...@nanigans.com wrote:

 Hi all,

 I currently have 2 clusters, one running on 1.1.10 using CQL2 and one
 running on 1.2.4 using CQL3 and Vnodes.   The machines in the 1.2.4 cluster
 are expected to have better IO performance as we are going from 1 SSD data
 disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2
 cluster with higher end drives (commit logs are on their own disk shared
 with the OS).  I am doing some stress testing on the 1.2 cluster and have
 found that although the reads / sec as seen from iostat are approximately
 the same (3K / sec) in both clusters, the MB/s read in the new cluster is
 MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2).  As a result,
 I am seeing excessive iowait in the 1.2 cluster causing high average read
 times of 30 ms under the same load (1.1 cluster sees around 5 ms).  They
 are both using Leveled compaction but one thing I did change in the new
 cluster was to increase the sstable size from the OOTB setting to 32 MB.
  Note that my reads are by definition highly random as we are running
 memcached in front for various reasons.  Does cassandra need to read the
 entire SSTable when fetching a row or only the relevant chunk (I have the
 OOTB chunk size and BF settings)?  I just decreased the sstable size to 5
 MB and am waiting for compactions to complete to see if that makes a
 difference.

 Thanks!

 Relevant table definition if helpful (note that I also changed to the LZ4
 compressor expecting better read performance and I decreased the crc change
 again to minimize read latency):

 CREATE TABLE global_user (
 user_id BIGINT,
 app_id INT,
 type TEXT,
 name TEXT,
 last TIMESTAMP,
 paid BOOLEAN,
 values mapTIMESTAMP,FLOAT,
 sku_time mapTEXT,TIMESTAMP,
 extra_param mapTEXT,TEXT,
 PRIMARY KEY (user_id, app_id, type, name)
 ) with 
 compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'}
 and
 compaction={'class':'LeveledCompactionStrategy'} and
 compaction_strategy_options = {'sstable_size_in_mb':5} and
 gc_grace_seconds = 86400;



Re: SSTable size versus read performance

2013-05-16 Thread Keith Wright
The biggest reason I'm using compression here is that my data lends itself well 
to it due to the composite columns.  My current compression ratio is 30.5%.  
Not sure it matters but my BF false positive ration os 0.048.

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 10:23 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: SSTable size versus read performance

I am not sure of the new default is to use compression, but I do not believe 
compression is a good default. I find compression is better for larger column 
families that are sparsely read. For high throughput CF's I feel that 
decompressing larger blocks hurts performance more then compression adds.


On Thu, May 16, 2013 at 10:14 AM, Keith Wright 
kwri...@nanigans.commailto:kwri...@nanigans.com wrote:
Hi all,

I currently have 2 clusters, one running on 1.1.10 using CQL2 and one 
running on 1.2.4 using CQL3 and Vnodes.   The machines in the 1.2.4 cluster are 
expected to have better IO performance as we are going from 1 SSD data disk per 
node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with 
higher end drives (commit logs are on their own disk shared with the OS).  I am 
doing some stress testing on the 1.2 cluster and have found that although the 
reads / sec as seen from iostat are approximately the same (3K / sec) in both 
clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as 
compared to 30-50 MB/s in 1.2).  As a result, I am seeing excessive iowait in 
the 1.2 cluster causing high average read times of 30 ms under the same load 
(1.1 cluster sees around 5 ms).  They are both using Leveled compaction but one 
thing I did change in the new cluster was to increase the sstable size from the 
OOTB setting to 32 MB.  Note that my reads are by definition highly random as 
we are running memcached in front for various reasons.  Does cassandra need to 
read the entire SSTable when fetching a row or only the relevant chunk (I have 
the OOTB chunk size and BF settings)?  I just decreased the sstable size to 5 
MB and am waiting for compactions to complete to see if that makes a difference.

Thanks!

Relevant table definition if helpful (note that I also changed to the LZ4 
compressor expecting better read performance and I decreased the crc change 
again to minimize read latency):

CREATE TABLE global_user (
user_id BIGINT,
app_id INT,
type TEXT,
name TEXT,
last TIMESTAMP,
paid BOOLEAN,
values mapTIMESTAMP,FLOAT,
sku_time mapTEXT,TIMESTAMP,
extra_param mapTEXT,TEXT,
PRIMARY KEY (user_id, app_id, type, name)
) with 
compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and
compaction={'class':'LeveledCompactionStrategy'} and
compaction_strategy_options = {'sstable_size_in_mb':5} and
gc_grace_seconds = 86400;



Re: SSTable size versus read performance

2013-05-16 Thread Edward Capriolo
With you use compression you should play with your block size. I believe
the default may be 32K but I had more success with 8K, nearly same
compression ratio, less young gen memory pressure.


On Thu, May 16, 2013 at 10:42 AM, Keith Wright kwri...@nanigans.com wrote:

 The biggest reason I'm using compression here is that my data lends itself
 well to it due to the composite columns.  My current compression ratio is
 30.5%.  Not sure it matters but my BF false positive ration os 0.048.

 From: Edward Capriolo edlinuxg...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, May 16, 2013 10:23 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: SSTable size versus read performance

 I am not sure of the new default is to use compression, but I do not
 believe compression is a good default. I find compression is better for
 larger column families that are sparsely read. For high throughput CF's I
 feel that decompressing larger blocks hurts performance more then
 compression adds.


 On Thu, May 16, 2013 at 10:14 AM, Keith Wright kwri...@nanigans.comwrote:

 Hi all,

 I currently have 2 clusters, one running on 1.1.10 using CQL2 and one
 running on 1.2.4 using CQL3 and Vnodes.   The machines in the 1.2.4 cluster
 are expected to have better IO performance as we are going from 1 SSD data
 disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2
 cluster with higher end drives (commit logs are on their own disk shared
 with the OS).  I am doing some stress testing on the 1.2 cluster and have
 found that although the reads / sec as seen from iostat are approximately
 the same (3K / sec) in both clusters, the MB/s read in the new cluster is
 MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2).  As a result,
 I am seeing excessive iowait in the 1.2 cluster causing high average read
 times of 30 ms under the same load (1.1 cluster sees around 5 ms).  They
 are both using Leveled compaction but one thing I did change in the new
 cluster was to increase the sstable size from the OOTB setting to 32 MB.
  Note that my reads are by definition highly random as we are running
 memcached in front for various reasons.  Does cassandra need to read the
 entire SSTable when fetching a row or only the relevant chunk (I have the
 OOTB chunk size and BF settings)?  I just decreased the sstable size to 5
 MB and am waiting for compactions to complete to see if that makes a
 difference.

 Thanks!

 Relevant table definition if helpful (note that I also changed to the LZ4
 compressor expecting better read performance and I decreased the crc change
 again to minimize read latency):

 CREATE TABLE global_user (
 user_id BIGINT,
 app_id INT,
 type TEXT,
 name TEXT,
 last TIMESTAMP,
 paid BOOLEAN,
 values mapTIMESTAMP,FLOAT,
 sku_time mapTEXT,TIMESTAMP,
 extra_param mapTEXT,TEXT,
 PRIMARY KEY (user_id, app_id, type, name)
 ) with 
 compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'}
 and
 compaction={'class':'LeveledCompactionStrategy'} and
 compaction_strategy_options = {'sstable_size_in_mb':5} and
 gc_grace_seconds = 86400;





Re: SSTable size versus read performance

2013-05-16 Thread Keith Wright
Does Cassandra need to load the entire SSTable into memory to uncompress it or 
does it only load the relevant block?  I ask because if its the latter, that 
would not explain why I'm seeing so much higher read MB/s in the 1.2 cluster as 
the block sizes are the same in both.

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 10:47 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: SSTable size versus read performance

With you use compression you should play with your block size. I believe the 
default may be 32K but I had more success with 8K, nearly same compression 
ratio, less young gen memory pressure.


On Thu, May 16, 2013 at 10:42 AM, Keith Wright 
kwri...@nanigans.commailto:kwri...@nanigans.com wrote:
The biggest reason I'm using compression here is that my data lends itself well 
to it due to the composite columns.  My current compression ratio is 30.5%.  
Not sure it matters but my BF false positive ration os 0.048.

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 10:23 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: SSTable size versus read performance

I am not sure of the new default is to use compression, but I do not believe 
compression is a good default. I find compression is better for larger column 
families that are sparsely read. For high throughput CF's I feel that 
decompressing larger blocks hurts performance more then compression adds.


On Thu, May 16, 2013 at 10:14 AM, Keith Wright 
kwri...@nanigans.commailto:kwri...@nanigans.com wrote:
Hi all,

I currently have 2 clusters, one running on 1.1.10 using CQL2 and one 
running on 1.2.4 using CQL3 and Vnodes.   The machines in the 1.2.4 cluster are 
expected to have better IO performance as we are going from 1 SSD data disk per 
node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with 
higher end drives (commit logs are on their own disk shared with the OS).  I am 
doing some stress testing on the 1.2 cluster and have found that although the 
reads / sec as seen from iostat are approximately the same (3K / sec) in both 
clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as 
compared to 30-50 MB/s in 1.2).  As a result, I am seeing excessive iowait in 
the 1.2 cluster causing high average read times of 30 ms under the same load 
(1.1 cluster sees around 5 ms).  They are both using Leveled compaction but one 
thing I did change in the new cluster was to increase the sstable size from the 
OOTB setting to 32 MB.  Note that my reads are by definition highly random as 
we are running memcached in front for various reasons.  Does cassandra need to 
read the entire SSTable when fetching a row or only the relevant chunk (I have 
the OOTB chunk size and BF settings)?  I just decreased the sstable size to 5 
MB and am waiting for compactions to complete to see if that makes a difference.

Thanks!

Relevant table definition if helpful (note that I also changed to the LZ4 
compressor expecting better read performance and I decreased the crc change 
again to minimize read latency):

CREATE TABLE global_user (
user_id BIGINT,
app_id INT,
type TEXT,
name TEXT,
last TIMESTAMP,
paid BOOLEAN,
values mapTIMESTAMP,FLOAT,
sku_time mapTEXT,TIMESTAMP,
extra_param mapTEXT,TEXT,
PRIMARY KEY (user_id, app_id, type, name)
) with 
compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and
compaction={'class':'LeveledCompactionStrategy'} and
compaction_strategy_options = {'sstable_size_in_mb':5} and
gc_grace_seconds = 86400;




Re: (unofficial) Community Poll for Production Operators : Repair

2013-05-16 Thread Janne Jalkanen

Might you be experiencing this? 
https://issues.apache.org/jira/browse/CASSANDRA-4417

/Janne

On May 16, 2013, at 14:49 , Alain RODRIGUEZ arodr...@gmail.com wrote:

 @Rob: Thanks about the feedback.
 
 Yet I have a weird behavior still unexplained about repairing. Are counters 
 supposed to be repaired too ? I mean, while reading at CL.ONE I can have 
 different values depending on what node is answering. Even after a read 
 repair or a full repair. Shouldn't a repair fix these discrepancies ?
 
 The only way I found to get always the same count is to read data at 
 CL.QUORUM, but this is a workaround since the data itself remains wrong on 
 some nodes. 
 
 Any clue on it ?
 
 Alain
 
 2013/5/15 Edward Capriolo edlinuxg...@gmail.com
 http://basho.com/introducing-riak-1-3/
 
 Introduced Active Anti-Entropy. Riak now has active anti-entropy. In 
 distributed systems, inconsistencies can arise between replicas due to 
 failure modes, concurrent updates, and physical data loss or corruption. 
 Pre-1.3 Riak already had several features for repairing this “entropy”, but 
 they all required some form of user intervention. Riak 1.3 introduces 
 automatic, self-healing properties that repair entropy on an ongoing basis.
 
 
 On Wed, May 15, 2013 at 5:32 PM, Robert Coli rc...@eventbrite.com wrote:
 On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
  Rob, I was wondering something. Are you a commiter working on improving the
  repair or something similar ?
 
 I am not a committer [1], but I have an active interest in potential
 improvements to the best practices for repair. The specific change
 that I am considering is a modification to the default
 gc_grace_seconds value, which seems picked out of a hat at 10 days. My
 view is that the current implementation of repair has such negative
 performance consequences that I do not believe that holding onto
 tombstones for longer than 10 days could possibly be as bad as the
 fixed cost of running repair once every 10 days. I believe that this
 value is too low for a default (it also does not map cleanly to the
 work week!) and likely should be increased to 14, 21 or 28 days.
 
  Anyway, if a commiter (or any other expert) could give us some feedback on
  our comments (Are we doing well or not, whether things we observe are normal
  or unexplained, what is going to be improved in the future about repair...)
 
 1) you are doing things according to best practice
 2) unfortunately your experience with significantly degraded
 performance, including a blocked go-live due to repair bloat is pretty
 typical
 3) the things you are experiencing are part of the current
 implementation of repair and are also typical, however I do not
 believe they are fully explained [2]
 4) as has been mentioned further down thread, there are discussions
 regarding (and some already committed) improvements to both the
 current repair paradigm and an evolution to a new paradigm
 
 Thanks to all for the responses so far, please keep them coming! :D
 
 =Rob
 [1] hence the (unofficial) tag for this thread. I do have minor
 patches accepted to the codebase, but always merged by an actual
 committer. :)
 [2] driftx@#cassandra feels that these things are explained/understood
 by core team, and points to
 https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful
 approach to minimize same.
 
 



Re: best practices on EC2 question

2013-05-16 Thread Janne Jalkanen
On May 16, 2013, at 17:05 , Brian Tarbox tar...@cabotresearch.com wrote:

 An alternative that we had explored for a while was to do a two stage backup:
 1) copy a C* snapshot from the ephemeral drive to an EBS drive
 2) do an EBS snapshot to S3.
 
 The idea being that EBS is quite reliable, S3 is still the emergency backup 
 and copying back from EBS to ephemeral is likely much faster than the 15 
 MB/sec we get from S3.

Yup, this is what we do.  We use rsync with --bwlimit=4000 to copy the 
snapshots from the eph drive to EBS; this is intentionally very low so that the 
backup process does not take eat our I/O.  This is on m1.xlarge instances; YMMV 
so measure :).  EBS drives are then snapshot with ec2-consistent-snapshot and 
then old snapshots expired using ec2-expire-snapshots (I believe these scripts 
are from Alestic).

/Janne



Re: (unofficial) Community Poll for Production Operators : Repair

2013-05-16 Thread Alain RODRIGUEZ
I indeed had some of those in the past. But my point is not that much to
understand how I can get different counts depending on the node (I consider
this as a weakness of counters and I am aware of it),  my wonder is more
why those inconsistent, distinct counters never converge even after a
repair. Your last comment on this JIRA summarize quite well our problem.

I hope that commiters will find out something.


2013/5/16 Janne Jalkanen janne.jalka...@ecyrd.com


 Might you be experiencing this?
 https://issues.apache.org/jira/browse/CASSANDRA-4417

 /Janne

 On May 16, 2013, at 14:49 , Alain RODRIGUEZ arodr...@gmail.com wrote:

 @Rob: Thanks about the feedback.

 Yet I have a weird behavior still unexplained about repairing. Are
 counters supposed to be repaired too ? I mean, while reading at CL.ONE I
 can have different values depending on what node is answering. Even after a
 read repair or a full repair. Shouldn't a repair fix these discrepancies ?

 The only way I found to get always the same count is to read data at
 CL.QUORUM, but this is a workaround since the data itself remains wrong on
 some nodes.

 Any clue on it ?

 Alain

 2013/5/15 Edward Capriolo edlinuxg...@gmail.com

 http://basho.com/introducing-riak-1-3/

 Introduced Active Anti-Entropy. Riak now has active anti-entropy. In
 distributed systems, inconsistencies can arise between replicas due to
 failure modes, concurrent updates, and physical data loss or corruption.
 Pre-1.3 Riak already had several features for repairing this “entropy”, but
 they all required some form of user intervention. Riak 1.3 introduces
 automatic, self-healing properties that repair entropy on an ongoing basis.


 On Wed, May 15, 2013 at 5:32 PM, Robert Coli rc...@eventbrite.comwrote:

 On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ arodr...@gmail.com
 wrote:
  Rob, I was wondering something. Are you a commiter working on
 improving the
  repair or something similar ?

 I am not a committer [1], but I have an active interest in potential
 improvements to the best practices for repair. The specific change
 that I am considering is a modification to the default
 gc_grace_seconds value, which seems picked out of a hat at 10 days. My
 view is that the current implementation of repair has such negative
 performance consequences that I do not believe that holding onto
 tombstones for longer than 10 days could possibly be as bad as the
 fixed cost of running repair once every 10 days. I believe that this
 value is too low for a default (it also does not map cleanly to the
 work week!) and likely should be increased to 14, 21 or 28 days.

  Anyway, if a commiter (or any other expert) could give us some
 feedback on
  our comments (Are we doing well or not, whether things we observe are
 normal
  or unexplained, what is going to be improved in the future about
 repair...)

 1) you are doing things according to best practice
 2) unfortunately your experience with significantly degraded
 performance, including a blocked go-live due to repair bloat is pretty
 typical
 3) the things you are experiencing are part of the current
 implementation of repair and are also typical, however I do not
 believe they are fully explained [2]
 4) as has been mentioned further down thread, there are discussions
 regarding (and some already committed) improvements to both the
 current repair paradigm and an evolution to a new paradigm

 Thanks to all for the responses so far, please keep them coming! :D

 =Rob
 [1] hence the (unofficial) tag for this thread. I do have minor
 patches accepted to the codebase, but always merged by an actual
 committer. :)
 [2] driftx@#cassandra feels that these things are explained/understood
 by core team, and points to
 https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful
 approach to minimize same.







Re: Major compaction does not seems to free the disk space a lot if wide rows are used.

2013-05-16 Thread Louvet, Jacques
Boris,

We hit exactly the same issue, and you are correct the newly created SSTables 
are the cause of why most of the column-tombstone not being purged.

There is an improvement in 1.2 train where both the minimum and maximum 
timestamp for a row is now stored and used during the compaction to determine 
if the portion of the row can be purged.
However, this only appears to help Major compaction has the other restriction 
where all the files encompassing the deleted rows must be part of the 
compaction for the row to be purged still remains.

We have switched to column delete rather that row delete wherever practical. A 
little more work on the app, but a big improvement in reads due to much more 
efficient compaction.

Regards,
Jacques

From: Boris Yen yulin...@gmail.commailto:yulin...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 04:07
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org, 
d...@cassandra.apache.orgmailto:d...@cassandra.apache.org 
d...@cassandra.apache.orgmailto:d...@cassandra.apache.org
Subject: Major compaction does not seems to free the disk space a lot if wide 
rows are used.

Hi All,

Sorry for the wide distribution.

Our cassandra is running on 1.0.10. Recently, we are facing a weird situation. 
We have a column family containing wide rows (each row might have a few million 
of columns). We delete the columns on a daily basis and we also run major 
compaction on it everyday to free up disk space (the gc_grace is set to 600 
seconds).

However, every time we run the major compaction, only 1 or 2GB disk space is 
freed. We tried to delete most of the data before running compaction, however, 
the result is pretty much the same.

So, we tried to check the source code. It seems that the column tombstones 
could only be purged when the row key is not in other sstables. I know the 
major compaction should include all sstables, however, in our use case, columns 
get inserted rapidly. This will make the cassandra flush the memtables to disk 
and create new sstables. The newly created sstables will have the same keys as 
the sstables that are being compacted (the compaction will take 2 or 3 hours to 
finish). My question is that will these newly created sstables be the cause of 
why most of the column-tombstone not being purged?

p.s. We also did some other tests. We inserted data to the same CF with the 
same wide-row pattern and deleted most of the data. This time we stopped all 
the writes to cassandra and did the compaction. The disk usage decreased 
dramatically.

Any suggestions or is this a know issue.

Thanks and Regards,
Boris


Re: SSTable size versus read performance

2013-05-16 Thread Igor
My 5 cents: I'd check blockdev --getra for data drives - too high values 
for readahead (default to 256 for debian) can hurt read performance.


On 05/16/2013 05:14 PM, Keith Wright wrote:

Hi all,

I currently have 2 clusters, one running on 1.1.10 using CQL2 and 
one running on 1.2.4 using CQL3 and Vnodes.   The machines in the 
1.2.4 cluster are expected to have better IO performance as we are 
going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data 
disks per node in the 1.2 cluster with higher end drives (commit logs 
are on their own disk shared with the OS).  I am doing some stress 
testing on the 1.2 cluster and have found that although the reads / 
sec as seen from iostat are approximately the same (3K / sec) in both 
clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 
1.1 as compared to 30-50 MB/s in 1.2).  As a result, I am seeing 
excessive iowait in the 1.2 cluster causing high average read times of 
30 ms under the same load (1.1 cluster sees around 5 ms).  They are 
both using Leveled compaction but one thing I did change in the new 
cluster was to increase the sstable size from the OOTB setting to 32 
MB.  Note that my reads are by definition highly random as we are 
running memcached in front for various reasons.  Does cassandra need 
to read the entire SSTable when fetching a row or only the relevant 
chunk (I have the OOTB chunk size and BF settings)?  I just decreased 
the sstable size to 5 MB and am waiting for compactions to complete to 
see if that makes a difference.


Thanks!

Relevant table definition if helpful (note that I also changed to the 
LZ4 compressor expecting better read performance and I decreased the 
crc change again to minimize read latency):


CREATE TABLE global_user (
user_id BIGINT,
app_id INT,
type TEXT,
name TEXT,
last TIMESTAMP,
paid BOOLEAN,
values mapTIMESTAMP,FLOAT,
sku_time mapTEXT,TIMESTAMP,
extra_param mapTEXT,TEXT,
PRIMARY KEY (user_id, app_id, type, name)
) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} 
and

compaction={'class':'LeveledCompactionStrategy'} and
compaction_strategy_options = {'sstable_size_in_mb':5} and
gc_grace_seconds = 86400;




Re: SSTable size versus read performance

2013-05-16 Thread Keith Wright
We actually have it set to 512.  I have tried decreasing my SSTable size to 5 
MB and changing the chunk size to 8 kb (and run an sstableupgrade to ensure 
they took effect) but am still seeing similar performance.  Is anyone running 
lz4 compression in production?  I'm thinking of reverting back to snappy to see 
if that makes a difference.

I appreciate all of the help!

From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 1:55 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: SSTable size versus read performance

My 5 cents: I'd check blockdev --getra for data drives - too high values for 
readahead (default to 256 for debian) can hurt read performance.

On 05/16/2013 05:14 PM, Keith Wright wrote:
Hi all,

I currently have 2 clusters, one running on 1.1.10 using CQL2 and one 
running on 1.2.4 using CQL3 and Vnodes.   The machines in the 1.2.4 cluster are 
expected to have better IO performance as we are going from 1 SSD data disk per 
node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with 
higher end drives (commit logs are on their own disk shared with the OS).  I am 
doing some stress testing on the 1.2 cluster and have found that although the 
reads / sec as seen from iostat are approximately the same (3K / sec) in both 
clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as 
compared to 30-50 MB/s in 1.2).  As a result, I am seeing excessive iowait in 
the 1.2 cluster causing high average read times of 30 ms under the same load 
(1.1 cluster sees around 5 ms).  They are both using Leveled compaction but one 
thing I did change in the new cluster was to increase the sstable size from the 
OOTB setting to 32 MB.  Note that my reads are by definition highly random as 
we are running memcached in front for various reasons.  Does cassandra need to 
read the entire SSTable when fetching a row or only the relevant chunk (I have 
the OOTB chunk size and BF settings)?  I just decreased the sstable size to 5 
MB and am waiting for compactions to complete to see if that makes a difference.

Thanks!

Relevant table definition if helpful (note that I also changed to the LZ4 
compressor expecting better read performance and I decreased the crc change 
again to minimize read latency):

CREATE TABLE global_user (
user_id BIGINT,
app_id INT,
type TEXT,
name TEXT,
last TIMESTAMP,
paid BOOLEAN,
values mapTIMESTAMP,FLOAT,
sku_time mapTEXT,TIMESTAMP,
extra_param mapTEXT,TEXT,
PRIMARY KEY (user_id, app_id, type, name)
) with 
compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and
compaction={'class':'LeveledCompactionStrategy'} and
compaction_strategy_options = {'sstable_size_in_mb':5} and
gc_grace_seconds = 86400;



Re: SSTable size versus read performance

2013-05-16 Thread Bryan Talbot
512 sectors for read-ahead.  Are your new fancy SSD drives using large
sectors?  If your read-ahead is really reading 512 x 4KB per random IO,
then that 2 MB per read seems like a lot of extra overhead.

-Bryan




On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.com wrote:

 We actually have it set to 512.  I have tried decreasing my SSTable size
 to 5 MB and changing the chunk size to 8 kb

 From: Igor i...@4friends.od.ua
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, May 16, 2013 1:55 PM

 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: SSTable size versus read performance

 My 5 cents: I'd check blockdev --getra for data drives - too high values
 for readahead (default to 256 for debian) can hurt read performance.




Re: SSTable size versus read performance

2013-05-16 Thread Edward Capriolo
I was going to say something similar I feel like the SSD drives read much
more then the standard drive. Read Ahead/arge sectors could and probably
does explain it.


On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot btal...@aeriagames.comwrote:

 512 sectors for read-ahead.  Are your new fancy SSD drives using large
 sectors?  If your read-ahead is really reading 512 x 4KB per random IO,
 then that 2 MB per read seems like a lot of extra overhead.

 -Bryan




 On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.comwrote:

 We actually have it set to 512.  I have tried decreasing my SSTable size
 to 5 MB and changing the chunk size to 8 kb

 From: Igor i...@4friends.od.ua
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, May 16, 2013 1:55 PM

 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: SSTable size versus read performance

 My 5 cents: I'd check blockdev --getra for data drives - too high values
 for readahead (default to 256 for debian) can hurt read performance.




Re: SSTable size versus read performance

2013-05-16 Thread Igor
just in case it will be useful to somebody - here is my checklist for 
better read performance from SSD


1. limit read-ahead to 16 or 32
2. enable 'trickle_fsync' (available starting from cassandra 1.1.x)
3. use 'deadline' io-scheduler (much more important for rotational 
drives then for SSD)

4. format data partition starting on 2048 sector boundary
5. use ext4 with noatime,nodiratime,discard mount options

On 05/16/2013 10:48 PM, Edward Capriolo wrote:
I was going to say something similar I feel like the SSD drives read 
much more then the standard drive. Read Ahead/arge sectors could and 
probably does explain it.



On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot btal...@aeriagames.com 
mailto:btal...@aeriagames.com wrote:


512 sectors for read-ahead.  Are your new fancy SSD drives using
large sectors?  If your read-ahead is really reading 512 x 4KB per
random IO, then that 2 MB per read seems like a lot of extra overhead.

-Bryan




On Thu, May 16, 2013 at 12:35 PM, Keith Wright
kwri...@nanigans.com mailto:kwri...@nanigans.com wrote:

We actually have it set to 512.  I have tried decreasing my
SSTable size to 5 MB and changing the chunk size to 8 kb

From: Igor i...@4friends.od.ua mailto:i...@4friends.od.ua
Reply-To: user@cassandra.apache.org
mailto:user@cassandra.apache.org user@cassandra.apache.org
mailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 1:55 PM

To: user@cassandra.apache.org
mailto:user@cassandra.apache.org user@cassandra.apache.org
mailto:user@cassandra.apache.org
Subject: Re: SSTable size versus read performance

My 5 cents: I'd check blockdev --getra for data drives - too
high values for readahead (default to 256 for debian) can hurt
read performance.






Re: Major compaction does not seems to free the disk space a lot if wide rows are used.

2013-05-16 Thread Edward Capriolo
This makes sense. Unless you are running major compaction a delete could
only happen if the bloom filters confirmed the row was not in the sstables
not being compacted. If your rows are wide the odds are that they are in
most/all sstables and then finally removing them would be tricky.


On Thu, May 16, 2013 at 12:00 PM, Louvet, Jacques 
jacques_lou...@cable.comcast.com wrote:

  Boris,

  We hit exactly the same issue, and you are correct the newly created
 SSTables are the cause of why most of the column-tombstone not being purged.

  There is an improvement in 1.2 train where both the minimum and maximum
 timestamp for a row is now stored and used during the compaction to
 determine if the portion of the row can be purged.
 However, this only appears to help Major compaction has the other
 restriction where all the files encompassing the deleted rows must be part
 of the compaction for the row to be purged still remains.

  We have switched to column delete rather that row delete wherever
 practical. A little more work on the app, but a big improvement in reads
 due to much more efficient compaction.

  Regards,
 Jacques

   From: Boris Yen yulin...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, May 16, 2013 04:07
 To: user@cassandra.apache.org user@cassandra.apache.org, 
 d...@cassandra.apache.org d...@cassandra.apache.org
 Subject: Major compaction does not seems to free the disk space a lot if
 wide rows are used.

  Hi All,

 Sorry for the wide distribution.

  Our cassandra is running on 1.0.10. Recently, we are facing a weird
 situation. We have a column family containing wide rows (each row might
 have a few million of columns). We delete the columns on a daily basis and
 we also run major compaction on it everyday to free up disk space (the
 gc_grace is set to 600 seconds).

  However, every time we run the major compaction, only 1 or 2GB disk space
 is freed. We tried to delete most of the data before running compaction,
 however, the result is pretty much the same.

  So, we tried to check the source code. It seems that the column
 tombstones could only be purged when the row key is not in other sstables.
 I know the major compaction should include all sstables, however, in our
 use case, columns get inserted rapidly. This will make the cassandra flush
 the memtables to disk and create new sstables. The newly created sstables
 will have the same keys as the sstables that are being compacted (the
 compaction will take 2 or 3 hours to finish). My question is that will
 these newly created sstables be the cause of why most of the
 column-tombstone not being purged?

  p.s. We also did some other tests. We inserted data to the same CF with
 the same wide-row pattern and deleted most of the data. This time we
 stopped all the writes to cassandra and did the compaction. The disk usage
 decreased dramatically.

  Any suggestions or is this a know issue.

  Thanks and Regards,
  Boris



Re: SSTable size versus read performance

2013-05-16 Thread Keith Wright
Thank you for that.  I did not have trickle_fsync enabled and will give it a 
try.  I just noticed that when running a describe on my table, I do not see the 
sstable size parameter (compaction_strategy_options = {'sstable_size_in_mb':5}) 
included.  Is that expected?  Does it mean its using the defaults?

Assuming none of the tuning here makes a noticeable difference, my next step is 
to try switching from LZ4 to Snappy.  Any opinions on that?

Thanks!

CREATE TABLE global_user (
  user_id bigint,
  app_id int,
  type text,
  name text,
  extra_param maptext, text,
  last timestamp,
  paid boolean,
  sku_time maptext, timestamp,
  values maptimestamp, float,
  PRIMARY KEY (user_id, app_id, type, name)
) WITH
  bloom_filter_fp_chance=0.10 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=86400 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'LeveledCompactionStrategy'} AND
  compression={'chunk_length_kb': '8', 'crc_check_chance': '0.1', 
'sstable_compression': 'LZ4Compressor'};

From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 4:27 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: SSTable size versus read performance

just in case it will be useful to somebody - here is my checklist for better 
read performance from SSD

1. limit read-ahead to 16 or 32
2. enable 'trickle_fsync' (available starting from cassandra 1.1.x)
3. use 'deadline' io-scheduler (much more important for rotational drives then 
for SSD)
4. format data partition starting on 2048 sector boundary
5. use ext4 with noatime,nodiratime,discard mount options

On 05/16/2013 10:48 PM, Edward Capriolo wrote:
I was going to say something similar I feel like the SSD drives read much 
more then the standard drive. Read Ahead/arge sectors could and probably does 
explain it.


On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot 
btal...@aeriagames.commailto:btal...@aeriagames.com wrote:
512 sectors for read-ahead.  Are your new fancy SSD drives using large sectors? 
 If your read-ahead is really reading 512 x 4KB per random IO, then that 2 MB 
per read seems like a lot of extra overhead.

-Bryan




On Thu, May 16, 2013 at 12:35 PM, Keith Wright 
kwri...@nanigans.commailto:kwri...@nanigans.com wrote:
We actually have it set to 512.  I have tried decreasing my SSTable size to 5 
MB and changing the chunk size to 8 kb

From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, May 16, 2013 1:55 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: SSTable size versus read performance

My 5 cents: I'd check blockdev --getra for data drives - too high values for 
readahead (default to 256 for debian) can hurt read performance.





Re: SSTable size versus read performance

2013-05-16 Thread Edward Capriolo
lz4 is supposed to achieve similar compression while using less resources
then snappy. It is easy to test, just change then run a 'nodetool rebuild'
. Not sure when lz4 was introduced but being that it is new to cassandra
there may not be many large deployments running it yet.


On Thu, May 16, 2013 at 4:40 PM, Keith Wright kwri...@nanigans.com wrote:

 Thank you for that.  I did not have trickle_fsync enabled and will give it
 a try.  I just noticed that when running a describe on my table, I do not
 see the sstable size parameter (compaction_strategy_options =
 {'sstable_size_in_mb':5}) included.  Is that expected?  Does it mean its
 using the defaults?

 Assuming none of the tuning here makes a noticeable difference, my next
 step is to try switching from LZ4 to Snappy.  Any opinions on that?

 Thanks!

 CREATE TABLE global_user (
   user_id bigint,
   app_id int,
   type text,
   name text,
   extra_param maptext, text,
   last timestamp,
   paid boolean,
   sku_time maptext, timestamp,
   values maptimestamp, float,
   PRIMARY KEY (user_id, app_id, type, name)
 ) WITH
   bloom_filter_fp_chance=0.10 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=86400 AND
   read_repair_chance=0.10 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   compaction={'class': 'LeveledCompactionStrategy'} AND
   compression={'chunk_length_kb': '8', 'crc_check_chance': '0.1',
 'sstable_compression': 'LZ4Compressor'};

 From: Igor i...@4friends.od.ua
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, May 16, 2013 4:27 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: SSTable size versus read performance

 just in case it will be useful to somebody - here is my checklist for
 better read performance from SSD

 1. limit read-ahead to 16 or 32
 2. enable 'trickle_fsync' (available starting from cassandra 1.1.x)
 3. use 'deadline' io-scheduler (much more important for rotational drives
 then for SSD)
 4. format data partition starting on 2048 sector boundary
 5. use ext4 with noatime,nodiratime,discard mount options

 On 05/16/2013 10:48 PM, Edward Capriolo wrote:

 I was going to say something similar I feel like the SSD drives read much
 more then the standard drive. Read Ahead/arge sectors could and probably
 does explain it.


 On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot btal...@aeriagames.comwrote:

 512 sectors for read-ahead.  Are your new fancy SSD drives using large
 sectors?  If your read-ahead is really reading 512 x 4KB per random IO,
 then that 2 MB per read seems like a lot of extra overhead.

 -Bryan




 On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.comwrote:

 We actually have it set to 512.  I have tried decreasing my SSTable size
 to 5 MB and changing the chunk size to 8 kb

 From: Igor i...@4friends.od.ua
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, May 16, 2013 1:55 PM

 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: SSTable size versus read performance

 My 5 cents: I'd check blockdev --getra for data drives - too high values
 for readahead (default to 256 for debian) can hurt read performance.






Re: Upgrade 1.1.10 - 1.2.4

2013-05-16 Thread Everton Lima
But the problem is that I would like to use Cassandra embeeded? This is not
possible any more?


2013/5/15 Edward Capriolo edlinuxg...@gmail.com


 You are doing something wrong. What I was suggesting is only a hack for
 unit tests. Your not supposed to interact with CassandraServer directly
 like that as a client. Download hector and use the correct client libraries.

 On Wed, May 15, 2013 at 5:13 PM, Everton Lima peitin.inu...@gmail.comwrote:

 But using this code:

 ThriftSessionManager.instance.setCurrentSocket(new
 InetSocketAddress(9160));

 I will need to execute this line every time that I need to do somiething
 in Cassandra?  Like update a collunm family.

 Thanks for reply.


 2013/5/15 Edward Capriolo edlinuxg...@gmail.com

 If you are using hector it can setup the embedded server properly.

 When using the server directly inside cassandra I have run into a
 similar problem..


 https://github.com/edwardcapriolo/cassandra/blob/range-tombstone-thrift/test/unit/org/apache/cassandra/thrift/EndToEndTest.java

 @BeforeClass
 public static void setup() throws IOException, InvalidRequestException,
 TException{
 Schema.instance.clear(); // Schema are now written on disk and will be
 reloaded
 new EmbeddedCassandraService().start();
 ThriftSessionManager.instance.setCurrentSocket(new
 InetSocketAddress(9160));
 server = new CassandraServer();
 server.set_keyspace(Keyspace1);
 }



 On Wed, May 15, 2013 at 4:24 PM, Everton Lima 
 peitin.inu...@gmail.comwrote:

 Hello, someone can help me to use the Object CassandraServer() in
 version 1.2.4??
 I was using this in version 1.1.10, and thats work, but was happening
 something that I can not solve (sometimes my cpu up to 100% and stay
 forever) so I decide to do the upgrade.

 I start the cassandra with EmbeededCassandraServer.
 The actual error is:
 when the code call

 public ThriftClientState currentSession()
 {
 SocketAddress socket = remoteSocket.get();
 assert socket != null;

 ThriftClientState cState = activeSocketSessions.get(socket);
 if (cState == null)
 {
 cState = new ThriftClientState();
 activeSocketSessions.put(socket, cState);
 }
 return cState;
 }

 the variable socket is null. This methos is calling with:

 CassandraServer cs = new CassandraServer();
 cs.describe_keyspace()

 --
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA





 --
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA





-- 
Everton Lima Aleixo
Bacharel em Ciência da Computação pela UFG
Mestrando em Ciência da Computação pela UFG
Programador no LUPA


Re: Exception when running YCSB and Cassandra

2013-05-16 Thread aaron morton
You're nodes are overloaded. 

I'd recommend using m1.xlarge instead. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 1:59 PM, Rodrigo Felix rodrigofelixdealme...@gmail.com 
wrote:

 Hi,
 
I'm executing a workload on YCSB (50% read, 50% update) and after few 
 minutes I get the following exception:
 
 TimedOutException()
   at 
 org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7174)
   at 
 org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:540)
   at 
 org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:512)
   at com.yahoo.ycsb.db.CassandraClient10.read(CassandraClient10.java:259)
   at com.yahoo.ycsb.DBWrapper.read(DBWrapper.java:84)
   at 
 com.yahoo.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:469)
   at 
 com.yahoo.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:425)
   at com.yahoo.ycsb.ClientThread.run(ClientThread.java:105) 
 
I have 2 seeds on Amazon EC2 (large instance) and depending on the demand, 
 I add (or remove) new large instances.
Any suggestion to solve this problem or to tune cassandra?
Follows further info about cassandra installed. 
Thanks in advance.
 
 INFO 00:54:05,591 JVM vendor/version: Java HotSpot(TM) 64-Bit Server 
 VM/1.7.0_07
  INFO 00:54:05,592 Heap size: 1931476992/1931476992
 INFO 00:54:07,447 Cassandra version: 1.1.5
  INFO 00:54:07,448 Thrift API version: 19.32.0
 
 Att.
 
 Rodrigo Felix de Almeida
 LSBD - Universidade Federal do Ceará
 Project Manager
 MBA, CSM, CSPO, SCJP



Re:

2013-05-16 Thread aaron morton
Try the IRC room for the java driver or submit a ticket on the JIRA system, see 
the links here https://github.com/datastax/java-driver


Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 5:50 PM, bjbylh bjb...@me.com wrote:

 
 hello all:
 i use datastax java-driver to connect c* ,when the program calls 
 cluster.shutdown(),it prints 
 out:java.lang.NoSuchMethodError:org.jboss.netty.channelFactory.shutdown()V.
 but i do not kown why...
 c* is 1.2.4,java-driver is 1.0.0
 thank you.
 
 Sent from Samsung Mobile



Re: how to access data only on specific node

2013-05-16 Thread aaron morton
Are you using a multi get or a range slice ? 

Read Repair does not run for range slice queries. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 6:51 PM, Sergey Naumov sknau...@gmail.com wrote:

 see that RR works, but sometimes number of records have been read degrades. 
 RR is enabled on a random 10% of requests, see the read_repair_chance setting 
 for the CF. 
 
 OK, but I forgot to mention the main thing - each node in my config is a 
 standalone datacenter and distribution is DC1:1, DC2:1, DC3:1. So when I try 
 to read 1000 records with consistency ONE multiple times while connected to 
 node that just have been turned on, I got the following count of records read 
 (approximately): 120 220 310 390  950 960 965 !! 955 !! 970 ... If all 
 other nodes contain 1000 records and read repair already delivered 965 
 records to local DC (and so - local node), why sometimes I see degradation of 
 total records read?
 
 
 
 2013/5/15 aaron morton aa...@thelastpickle.com
 see that RR works, but sometimes number of records have been read degrades. 
 RR is enabled on a random 10% of requests, see the read_repair_chance setting 
 for the CF. 
 
  If so, then the question is: how to perform local reads to examine content 
 of specific node?
 You can check which nodes are replicas for a key using nodetool getendpoints
 
 If you want to read all the rows for a particular row you need to use a range 
 scan and limit it by the token ranges assigned to the node. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 14/05/2013, at 10:29 PM, Sergey Naumov sknau...@gmail.com wrote:
 
 Hello.
 
 I'am playing with demo cassandra cluster and decided to test read repair + 
 hinted handoff. 
 
 One node of a cluster was put down deliberately, and on the other nodes I 
 inserted some records (say 1000). HH is off on all nodes.
 Then I turned on the node, connected to it with cql (locally, so to 
 localhost) and performed 1000 reads by row key (with consistency ONE). I see 
 that RR works, but sometimes number of records have been read degrades. Is 
 it because consistency ONE and local reads is not the same thing? If so, 
 then the question is: how to perform local reads to examine content of 
 specific node?
 
 Thanks in advance,
 Sergey Naumov.
 
 



Re: The action of the file system at drop column family execution

2013-05-16 Thread aaron morton
 When drop column family is executed irrespective of the existence of 
 generation of Snapshot, $KS/$CF/ directory certainly remains. 
I don't think there is any code there to delete the empty directories. We only 
care about the files in there. 

Cheers


-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 7:41 PM, hiroshi.kise...@hitachi.com wrote:

 
 Dear Aaron Morton
 
 I'm Hiroshi. Thank you for the reply.
 $KS/$CF/snapshots directory namely :
 Under C:\var\lib\cassandra\data\MyKeyspace\testcf1
 
 dir command execution:
 2013/05/09  14:04DIR  .
 2013/05/09  14:04DIR  ..
0 File(s)  0  bytes
2 Dir(s)   139,587,530,752 bytes free
 
 snapshot was not generated. 
 It is a repeated question (I am sorry). 
 
 When drop column family is executed irrespective of the existence of 
 generation of Snapshot, $KS/$CF/ directory certainly remains. 
 
 It is a meaning? 
 --
 Hiroshi Kise
 
 
 
 
 
 
 
 Date: Wed, 15 May 2013 05:32:50 +0900, aaron morton aa...@thelastpickle.com 
 wrote;
 --- Begin of replied message --
 
 * Although a directory (column family:testcf1) remains, is it satisfactory 
 on a file system? 
 A snapshot is taken when a truncate or drop command is run. You should see a 
 $KS/$CF/snapshots directory. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 15/05/2013, at 12:45 AM, hiroshi.kise...@hitachi.com wrote:
 
 
 Hi everyone.
 Although it may be a stupid question, please give me instruction.
 
 [1.Question]
 A column family is deleted (drop column family testcf1;).
 As the upper result (dir command execution),
 DIR   testcf1
   0 File(s)   0 bytes  
   3 Dir(s)  139,587,596,288 bytes free
 
 * As Cassandra, it is the right action?
 * Although a directory (column family:testcf1) remains, is it satisfactory 
 on a file system? 
 
 --
 [2. Pre-processing]
 ( Cassandra-CLI was used. )
 First, a key space is created (create keyspace MyKeyspace;), 
 The key space was chosen (use MyKeyspace;). 
 
 And the column family created (create column family testcf1;). 
 --
 [3.Environment] 
 Cassandra 1.2.4
 OS: Windows 7
 
 
 Thank you for your consideration. 
 --
 Hiroshi Kise
 
 
  End of replied message ---



Re: How to add new DC to cluster when GossipingPropertyFileSnitch is used

2013-05-16 Thread aaron morton
You should configure the seeds as recommended regardless of the snitch used. 

You need to update the yaml file to start using the GossipingPropertyFileSnitch 
but after that it reads the cassandra-rackdc.properties file to get information 
about the node. It reads uses the information in gossip to get information 
about the other nodes in the cluster. 

If there is no info in gossip about a remote node, because say it has not been 
upgraded, it will fall back to using cassandra-topology.properties. 

Hope that helps. 
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 8:10 PM, Sergey Naumov sknau...@gmail.com wrote:

 As far as I understand, GossipingPropertyFileSnitch supposed to provide more 
 flexibility in nodes addition/removal. But what about addition of a DC? In 
 datastax documentation 
 (http://www.datastax.com/docs/1.2/operations/add_replace_nodes#add-dc) it is 
 said that cassandra-topology.properties could be updated without restart for 
 PropertyFileSnitch. But here 
 (http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) it it 
 said, that you MUST include at least one node from EACH data center. It is a 
 best practice to have at more than one seed node per data center and the seed 
 list should be the same for each node. At the first glance it seems that 
 PropertyFileSnitch will get necessary info from 
 cassandra-topology.properties, but for GossipingPropertyFileSnitch 
 modification of cassandra.yaml and restart of all nodes in all DCs will be 
 required. Could somebody clarify this topic?
 
 Thanks in advance,
 Sergey Naumov.



Re: Multiple cursors

2013-05-16 Thread aaron morton
We don't have cursors in the RDBMS sense of things.

If you are using thrift the recommendation is to use connection pooling and 
re-use connections for different requests. Note that you can not multiplex 
queries over the same thrift connection, you must wait for the response before 
issuing another request. The native binary transport allows multiplexing 
though. 

In general you should use one of the pre build client libraries as they will 
take care of connection pooling etc for you 
https://wiki.apache.org/cassandra/ClientOptions

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 16/05/2013, at 9:03 AM, Sam Mandes eng.salaman...@gmail.com wrote:

 Hello All,
 
 Is using multiple cursors simultaneously on the same C* connection a good 
 practice?
 
 I've an internal api for a project running thrift, I then need to query 
 something from C*. I do not like to create a new connection for every api 
 request. Thus, when my service initially starts I open a connection to C* and 
 with every request I create a new cursor.
 
 Thanks a lot



Re: C++ Thrift client

2013-05-16 Thread aaron morton
(Assuming you have enabled tcp_nodelay on the client socket)

Check the server side latency, using nodetool cfstats or nodetool cfhistograms. 

Check the logs for messages from the GCInspector about ParNew pauses.

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 16/05/2013, at 12:58 PM, Bill Hastings bllhasti...@gmail.com wrote:

 Hi All
 
 I am doing very small inserts into Cassandra in the range of say 64
 bytes. I use a C++ Thrift client and seem consistently get latencies
 anywhere between 35-45 ms. Could some one please advise as to what
 might be happening?
 
 thanks



Re:

2013-05-16 Thread Dave Brosius

what version of netty is on your classpath?

On 05/16/2013 07:33 PM, aaron morton wrote:
Try the IRC room for the java driver or submit a ticket on the JIRA 
system, see the links here https://github.com/datastax/java-driver



Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/05/2013, at 5:50 PM, bjbylh bjb...@me.com 
mailto:bjb...@me.com wrote:




hello all:
i use datastax java-driver to connect c* ,when the program calls 
cluster.shutdown(),it prints 
out:java.lang.NoSuchMethodError:org.jboss.netty.channelFactory.shutdown()V.

but i do not kown why...
c* is 1.2.4,java-driver is 1.0.0
thank you.

Sent from Samsung Mobile






Re: Decommission nodes starts to appear from one node (1.0.11)

2013-05-16 Thread Roshan
Thanks. This is kind of a expert advice for me. 



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Decommission-nodes-starts-to-appear-from-one-node-1-0-11-tp7587842p7587876.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: pycassa failures in large batch cycling

2013-05-16 Thread John R. Frank

On Tue, 14 May 2013, aaron morton wrote:


  After several cycles, pycassa starts getting connection failures.

Do you have the error stack ?Are the TimedOutExceptions or socket time 
outs or something else.



I figured out the problem here and made this ticket in jira:

   https://issues.apache.org/jira/browse/CASSANDRA-5575


Summary: the Thrift interfaces to Cassandra are simply not able to load 
large batches without putting the client into an infinite retry loop.


Seems that the only robust solutions involve either features added to 
Thrift and all Cassandra clients, or a new interface mechanism.


jrf


Re: Upgrade 1.1.10 - 1.2.4

2013-05-16 Thread Edward Capriolo
Please give an example of the code you are trying to execute.


On Thu, May 16, 2013 at 6:26 PM, Everton Lima peitin.inu...@gmail.comwrote:

 But the problem is that I would like to use Cassandra embeeded? This is
 not possible any more?


 2013/5/15 Edward Capriolo edlinuxg...@gmail.com


 You are doing something wrong. What I was suggesting is only a hack for
 unit tests. Your not supposed to interact with CassandraServer directly
 like that as a client. Download hector and use the correct client libraries.

 On Wed, May 15, 2013 at 5:13 PM, Everton Lima peitin.inu...@gmail.comwrote:

 But using this code:

 ThriftSessionManager.instance.setCurrentSocket(new
 InetSocketAddress(9160));

 I will need to execute this line every time that I need to do somiething
 in Cassandra?  Like update a collunm family.

 Thanks for reply.


 2013/5/15 Edward Capriolo edlinuxg...@gmail.com

 If you are using hector it can setup the embedded server properly.

 When using the server directly inside cassandra I have run into a
 similar problem..


 https://github.com/edwardcapriolo/cassandra/blob/range-tombstone-thrift/test/unit/org/apache/cassandra/thrift/EndToEndTest.java

 @BeforeClass
 public static void setup() throws IOException, InvalidRequestException,
 TException{
 Schema.instance.clear(); // Schema are now written on disk and will be
 reloaded
 new EmbeddedCassandraService().start();
 ThriftSessionManager.instance.setCurrentSocket(new
 InetSocketAddress(9160));
 server = new CassandraServer();
 server.set_keyspace(Keyspace1);
 }



 On Wed, May 15, 2013 at 4:24 PM, Everton Lima 
 peitin.inu...@gmail.comwrote:

 Hello, someone can help me to use the Object CassandraServer() in
 version 1.2.4??
 I was using this in version 1.1.10, and thats work, but was happening
 something that I can not solve (sometimes my cpu up to 100% and stay
 forever) so I decide to do the upgrade.

 I start the cassandra with EmbeededCassandraServer.
 The actual error is:
 when the code call

 public ThriftClientState currentSession()
 {
 SocketAddress socket = remoteSocket.get();
 assert socket != null;

 ThriftClientState cState = activeSocketSessions.get(socket);
 if (cState == null)
 {
 cState = new ThriftClientState();
 activeSocketSessions.put(socket, cState);
 }
 return cState;
 }

 the variable socket is null. This methos is calling with:

 CassandraServer cs = new CassandraServer();
 cs.describe_keyspace()

 --
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA





 --
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA





 --
 Everton Lima Aleixo
 Bacharel em Ciência da Computação pela UFG
 Mestrando em Ciência da Computação pela UFG
 Programador no LUPA




Announcing Mutagen

2013-05-16 Thread Todd Fast
Mutagen Cassandra is a framework providing schema versioning and mutation
for Apache Cassandra. It is similar to Flyway for SQL databases.

https://github.com/toddfast/mutagen-cassandra

Mutagen is a lightweight framework for applying versioned changes (known as
mutations) to a resource, in this case a Cassandra schema. Mutagen takes
into account the resource's existing state and only applies changes that
haven't yet been applied.

Schema mutation with Mutagen helps you make manageable changes to the
schema of live Cassandra instances as you update your client software, and
is especially useful when used across development, test, staging, and
production environments to automatically keep schemas updated.

This is a minimal but functional initial release, and I appreciate bug
reports, suggestions and pull requests.

Best,
Todd