Major compaction does not seems to free the disk space a lot if wide rows are used.
Hi All, Sorry for the wide distribution. Our cassandra is running on 1.0.10. Recently, we are facing a weird situation. We have a column family containing wide rows (each row might have a few million of columns). We delete the columns on a daily basis and we also run major compaction on it everyday to free up disk space (the gc_grace is set to 600 seconds). However, every time we run the major compaction, only 1 or 2GB disk space is freed. We tried to delete most of the data before running compaction, however, the result is pretty much the same. So, we tried to check the source code. It seems that the column tombstones could only be purged when the row key is not in other sstables. I know the major compaction should include all sstables, however, in our use case, columns get inserted rapidly. This will make the cassandra flush the memtables to disk and create new sstables. The newly created sstables will have the same keys as the sstables that are being compacted (the compaction will take 2 or 3 hours to finish). My question is that will these newly created sstables be the cause of why most of the column-tombstone not being purged? p.s. We also did some other tests. We inserted data to the same CF with the same wide-row pattern and deleted most of the data. This time we stopped all the writes to cassandra and did the compaction. The disk usage decreased dramatically. Any suggestions or is this a know issue. Thanks and Regards, Boris
Re: Decommission nodes starts to appear from one node (1.0.11)
I found this bug, seems it is fixed. But I can see that in my situation, the decommission node still I can see from the JMX console LoadMap attribute. Might this is the reason why hector says not enough replica?? Experts, any thoughts?? Thanks. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Decommission-nodes-starts-to-appear-from-one-node-1-0-11-tp7587842p7587845.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Decommission nodes starts to appear from one node (1.0.11)
Not sure to understand you correctly, but if you are dealing with ghost nodes that you want to remove, I never saw a node that could resist to an unsafeAssassinateEndpoint. http://grokbase.com/t/cassandra/user/12b9eaaqq4/remove-crashed-node http://grokbase.com/t/cassandra/user/133nmsm3hd/removing-old-nodes I hope this will help, I have no clue on why this is happening, I am not one of these experts you asked for ;-). Alain 2013/5/16 Roshan codeva...@gmail.com I found this bug, seems it is fixed. But I can see that in my situation, the decommission node still I can see from the JMX console LoadMap attribute. Might this is the reason why hector says not enough replica?? Experts, any thoughts?? Thanks. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Decommission-nodes-starts-to-appear-from-one-node-1-0-11-tp7587842p7587845.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: (unofficial) Community Poll for Production Operators : Repair
@Rob: Thanks about the feedback. Yet I have a weird behavior still unexplained about repairing. Are counters supposed to be repaired too ? I mean, while reading at CL.ONE I can have different values depending on what node is answering. Even after a read repair or a full repair. Shouldn't a repair fix these discrepancies ? The only way I found to get always the same count is to read data at CL.QUORUM, but this is a workaround since the data itself remains wrong on some nodes. Any clue on it ? Alain 2013/5/15 Edward Capriolo edlinuxg...@gmail.com http://basho.com/introducing-riak-1-3/ Introduced Active Anti-Entropy. Riak now has active anti-entropy. In distributed systems, inconsistencies can arise between replicas due to failure modes, concurrent updates, and physical data loss or corruption. Pre-1.3 Riak already had several features for repairing this “entropy”, but they all required some form of user intervention. Riak 1.3 introduces automatic, self-healing properties that repair entropy on an ongoing basis. On Wed, May 15, 2013 at 5:32 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Rob, I was wondering something. Are you a commiter working on improving the repair or something similar ? I am not a committer [1], but I have an active interest in potential improvements to the best practices for repair. The specific change that I am considering is a modification to the default gc_grace_seconds value, which seems picked out of a hat at 10 days. My view is that the current implementation of repair has such negative performance consequences that I do not believe that holding onto tombstones for longer than 10 days could possibly be as bad as the fixed cost of running repair once every 10 days. I believe that this value is too low for a default (it also does not map cleanly to the work week!) and likely should be increased to 14, 21 or 28 days. Anyway, if a commiter (or any other expert) could give us some feedback on our comments (Are we doing well or not, whether things we observe are normal or unexplained, what is going to be improved in the future about repair...) 1) you are doing things according to best practice 2) unfortunately your experience with significantly degraded performance, including a blocked go-live due to repair bloat is pretty typical 3) the things you are experiencing are part of the current implementation of repair and are also typical, however I do not believe they are fully explained [2] 4) as has been mentioned further down thread, there are discussions regarding (and some already committed) improvements to both the current repair paradigm and an evolution to a new paradigm Thanks to all for the responses so far, please keep them coming! :D =Rob [1] hence the (unofficial) tag for this thread. I do have minor patches accepted to the codebase, but always merged by an actual committer. :) [2] driftx@#cassandra feels that these things are explained/understood by core team, and points to https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful approach to minimize same.
vnodes ready for production ?
Hi, Adding vnodes is a big improvement to Cassandra, specifically because we have a fluctuating load on our Cassandra depending on the week, and it is quite annoying to add some nodes for one week or two, move tokens and then having to remove them and then move tokens again. Even more if we could automate some up-scale thanks to AWS alarms, It would be awesome. We don't use vnodes yet because Opscenter did not support this feature and because we need to have a reliable production. Now Opscenter handles vnodes. Are the vnodes feature and the tokens =vnodes transition safe enough to go live with vnodes ? What would be the transition process ? Does someone auto-scale his Cassandra cluster ? Any advice about vnodes ?
best practices on EC2 question
From this list and the NYC* conference it seems that the consensus configuration of C* on EC2 is to put the data on an ephemeral drive and then periodically back it the drive to S3...relying on C*'s inherent fault tolerance to deal with any data loss. Fine, and we're doing this...but we find that transfer rates from S3 back to a rebooted server instance are *very *slow...like 15 MB/second or roughly a minute per gigabyte. Calling EC2 support resulting in them saying sorry, that's how it is. I'm wondering if anyone a) has found a faster way to transfer to S3, or b) do people skip backups altogether except for huge outages and just let rebooted server instances come up empty to repopulate via C*? An alternative that we had explored for a while was to do a two stage backup: 1) copy a C* snapshot from the ephemeral drive to an EBS drive 2) do an EBS snapshot to S3. The idea being that EBS is quite reliable, S3 is still the emergency backup and copying back from EBS to ephemeral is likely much faster than the 15 MB/sec we get from S3. Thoughts? Brian
SSTable size versus read performance
Hi all, I currently have 2 clusters, one running on 1.1.10 using CQL2 and one running on 1.2.4 using CQL3 and Vnodes. The machines in the 1.2.4 cluster are expected to have better IO performance as we are going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with higher end drives (commit logs are on their own disk shared with the OS). I am doing some stress testing on the 1.2 cluster and have found that although the reads / sec as seen from iostat are approximately the same (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2). As a result, I am seeing excessive iowait in the 1.2 cluster causing high average read times of 30 ms under the same load (1.1 cluster sees around 5 ms). They are both using Leveled compaction but one thing I did change in the new cluster was to increase the sstable size from the OOTB setting to 32 MB. Note that my reads are by definition highly random as we are running memcached in front for various reasons. Does cassandra need to read the entire SSTable when fetching a row or only the relevant chunk (I have the OOTB chunk size and BF settings)? I just decreased the sstable size to 5 MB and am waiting for compactions to complete to see if that makes a difference. Thanks! Relevant table definition if helpful (note that I also changed to the LZ4 compressor expecting better read performance and I decreased the crc change again to minimize read latency): CREATE TABLE global_user ( user_id BIGINT, app_id INT, type TEXT, name TEXT, last TIMESTAMP, paid BOOLEAN, values mapTIMESTAMP,FLOAT, sku_time mapTEXT,TIMESTAMP, extra_param mapTEXT,TEXT, PRIMARY KEY (user_id, app_id, type, name) ) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and compaction={'class':'LeveledCompactionStrategy'} and compaction_strategy_options = {'sstable_size_in_mb':5} and gc_grace_seconds = 86400;
Re: SSTable size versus read performance
I am not sure of the new default is to use compression, but I do not believe compression is a good default. I find compression is better for larger column families that are sparsely read. For high throughput CF's I feel that decompressing larger blocks hurts performance more then compression adds. On Thu, May 16, 2013 at 10:14 AM, Keith Wright kwri...@nanigans.com wrote: Hi all, I currently have 2 clusters, one running on 1.1.10 using CQL2 and one running on 1.2.4 using CQL3 and Vnodes. The machines in the 1.2.4 cluster are expected to have better IO performance as we are going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with higher end drives (commit logs are on their own disk shared with the OS). I am doing some stress testing on the 1.2 cluster and have found that although the reads / sec as seen from iostat are approximately the same (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2). As a result, I am seeing excessive iowait in the 1.2 cluster causing high average read times of 30 ms under the same load (1.1 cluster sees around 5 ms). They are both using Leveled compaction but one thing I did change in the new cluster was to increase the sstable size from the OOTB setting to 32 MB. Note that my reads are by definition highly random as we are running memcached in front for various reasons. Does cassandra need to read the entire SSTable when fetching a row or only the relevant chunk (I have the OOTB chunk size and BF settings)? I just decreased the sstable size to 5 MB and am waiting for compactions to complete to see if that makes a difference. Thanks! Relevant table definition if helpful (note that I also changed to the LZ4 compressor expecting better read performance and I decreased the crc change again to minimize read latency): CREATE TABLE global_user ( user_id BIGINT, app_id INT, type TEXT, name TEXT, last TIMESTAMP, paid BOOLEAN, values mapTIMESTAMP,FLOAT, sku_time mapTEXT,TIMESTAMP, extra_param mapTEXT,TEXT, PRIMARY KEY (user_id, app_id, type, name) ) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and compaction={'class':'LeveledCompactionStrategy'} and compaction_strategy_options = {'sstable_size_in_mb':5} and gc_grace_seconds = 86400;
Re: SSTable size versus read performance
The biggest reason I'm using compression here is that my data lends itself well to it due to the composite columns. My current compression ratio is 30.5%. Not sure it matters but my BF false positive ration os 0.048. From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 10:23 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: SSTable size versus read performance I am not sure of the new default is to use compression, but I do not believe compression is a good default. I find compression is better for larger column families that are sparsely read. For high throughput CF's I feel that decompressing larger blocks hurts performance more then compression adds. On Thu, May 16, 2013 at 10:14 AM, Keith Wright kwri...@nanigans.commailto:kwri...@nanigans.com wrote: Hi all, I currently have 2 clusters, one running on 1.1.10 using CQL2 and one running on 1.2.4 using CQL3 and Vnodes. The machines in the 1.2.4 cluster are expected to have better IO performance as we are going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with higher end drives (commit logs are on their own disk shared with the OS). I am doing some stress testing on the 1.2 cluster and have found that although the reads / sec as seen from iostat are approximately the same (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2). As a result, I am seeing excessive iowait in the 1.2 cluster causing high average read times of 30 ms under the same load (1.1 cluster sees around 5 ms). They are both using Leveled compaction but one thing I did change in the new cluster was to increase the sstable size from the OOTB setting to 32 MB. Note that my reads are by definition highly random as we are running memcached in front for various reasons. Does cassandra need to read the entire SSTable when fetching a row or only the relevant chunk (I have the OOTB chunk size and BF settings)? I just decreased the sstable size to 5 MB and am waiting for compactions to complete to see if that makes a difference. Thanks! Relevant table definition if helpful (note that I also changed to the LZ4 compressor expecting better read performance and I decreased the crc change again to minimize read latency): CREATE TABLE global_user ( user_id BIGINT, app_id INT, type TEXT, name TEXT, last TIMESTAMP, paid BOOLEAN, values mapTIMESTAMP,FLOAT, sku_time mapTEXT,TIMESTAMP, extra_param mapTEXT,TEXT, PRIMARY KEY (user_id, app_id, type, name) ) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and compaction={'class':'LeveledCompactionStrategy'} and compaction_strategy_options = {'sstable_size_in_mb':5} and gc_grace_seconds = 86400;
Re: SSTable size versus read performance
With you use compression you should play with your block size. I believe the default may be 32K but I had more success with 8K, nearly same compression ratio, less young gen memory pressure. On Thu, May 16, 2013 at 10:42 AM, Keith Wright kwri...@nanigans.com wrote: The biggest reason I'm using compression here is that my data lends itself well to it due to the composite columns. My current compression ratio is 30.5%. Not sure it matters but my BF false positive ration os 0.048. From: Edward Capriolo edlinuxg...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, May 16, 2013 10:23 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: SSTable size versus read performance I am not sure of the new default is to use compression, but I do not believe compression is a good default. I find compression is better for larger column families that are sparsely read. For high throughput CF's I feel that decompressing larger blocks hurts performance more then compression adds. On Thu, May 16, 2013 at 10:14 AM, Keith Wright kwri...@nanigans.comwrote: Hi all, I currently have 2 clusters, one running on 1.1.10 using CQL2 and one running on 1.2.4 using CQL3 and Vnodes. The machines in the 1.2.4 cluster are expected to have better IO performance as we are going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with higher end drives (commit logs are on their own disk shared with the OS). I am doing some stress testing on the 1.2 cluster and have found that although the reads / sec as seen from iostat are approximately the same (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2). As a result, I am seeing excessive iowait in the 1.2 cluster causing high average read times of 30 ms under the same load (1.1 cluster sees around 5 ms). They are both using Leveled compaction but one thing I did change in the new cluster was to increase the sstable size from the OOTB setting to 32 MB. Note that my reads are by definition highly random as we are running memcached in front for various reasons. Does cassandra need to read the entire SSTable when fetching a row or only the relevant chunk (I have the OOTB chunk size and BF settings)? I just decreased the sstable size to 5 MB and am waiting for compactions to complete to see if that makes a difference. Thanks! Relevant table definition if helpful (note that I also changed to the LZ4 compressor expecting better read performance and I decreased the crc change again to minimize read latency): CREATE TABLE global_user ( user_id BIGINT, app_id INT, type TEXT, name TEXT, last TIMESTAMP, paid BOOLEAN, values mapTIMESTAMP,FLOAT, sku_time mapTEXT,TIMESTAMP, extra_param mapTEXT,TEXT, PRIMARY KEY (user_id, app_id, type, name) ) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and compaction={'class':'LeveledCompactionStrategy'} and compaction_strategy_options = {'sstable_size_in_mb':5} and gc_grace_seconds = 86400;
Re: SSTable size versus read performance
Does Cassandra need to load the entire SSTable into memory to uncompress it or does it only load the relevant block? I ask because if its the latter, that would not explain why I'm seeing so much higher read MB/s in the 1.2 cluster as the block sizes are the same in both. From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 10:47 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: SSTable size versus read performance With you use compression you should play with your block size. I believe the default may be 32K but I had more success with 8K, nearly same compression ratio, less young gen memory pressure. On Thu, May 16, 2013 at 10:42 AM, Keith Wright kwri...@nanigans.commailto:kwri...@nanigans.com wrote: The biggest reason I'm using compression here is that my data lends itself well to it due to the composite columns. My current compression ratio is 30.5%. Not sure it matters but my BF false positive ration os 0.048. From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 10:23 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: SSTable size versus read performance I am not sure of the new default is to use compression, but I do not believe compression is a good default. I find compression is better for larger column families that are sparsely read. For high throughput CF's I feel that decompressing larger blocks hurts performance more then compression adds. On Thu, May 16, 2013 at 10:14 AM, Keith Wright kwri...@nanigans.commailto:kwri...@nanigans.com wrote: Hi all, I currently have 2 clusters, one running on 1.1.10 using CQL2 and one running on 1.2.4 using CQL3 and Vnodes. The machines in the 1.2.4 cluster are expected to have better IO performance as we are going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with higher end drives (commit logs are on their own disk shared with the OS). I am doing some stress testing on the 1.2 cluster and have found that although the reads / sec as seen from iostat are approximately the same (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2). As a result, I am seeing excessive iowait in the 1.2 cluster causing high average read times of 30 ms under the same load (1.1 cluster sees around 5 ms). They are both using Leveled compaction but one thing I did change in the new cluster was to increase the sstable size from the OOTB setting to 32 MB. Note that my reads are by definition highly random as we are running memcached in front for various reasons. Does cassandra need to read the entire SSTable when fetching a row or only the relevant chunk (I have the OOTB chunk size and BF settings)? I just decreased the sstable size to 5 MB and am waiting for compactions to complete to see if that makes a difference. Thanks! Relevant table definition if helpful (note that I also changed to the LZ4 compressor expecting better read performance and I decreased the crc change again to minimize read latency): CREATE TABLE global_user ( user_id BIGINT, app_id INT, type TEXT, name TEXT, last TIMESTAMP, paid BOOLEAN, values mapTIMESTAMP,FLOAT, sku_time mapTEXT,TIMESTAMP, extra_param mapTEXT,TEXT, PRIMARY KEY (user_id, app_id, type, name) ) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and compaction={'class':'LeveledCompactionStrategy'} and compaction_strategy_options = {'sstable_size_in_mb':5} and gc_grace_seconds = 86400;
Re: (unofficial) Community Poll for Production Operators : Repair
Might you be experiencing this? https://issues.apache.org/jira/browse/CASSANDRA-4417 /Janne On May 16, 2013, at 14:49 , Alain RODRIGUEZ arodr...@gmail.com wrote: @Rob: Thanks about the feedback. Yet I have a weird behavior still unexplained about repairing. Are counters supposed to be repaired too ? I mean, while reading at CL.ONE I can have different values depending on what node is answering. Even after a read repair or a full repair. Shouldn't a repair fix these discrepancies ? The only way I found to get always the same count is to read data at CL.QUORUM, but this is a workaround since the data itself remains wrong on some nodes. Any clue on it ? Alain 2013/5/15 Edward Capriolo edlinuxg...@gmail.com http://basho.com/introducing-riak-1-3/ Introduced Active Anti-Entropy. Riak now has active anti-entropy. In distributed systems, inconsistencies can arise between replicas due to failure modes, concurrent updates, and physical data loss or corruption. Pre-1.3 Riak already had several features for repairing this “entropy”, but they all required some form of user intervention. Riak 1.3 introduces automatic, self-healing properties that repair entropy on an ongoing basis. On Wed, May 15, 2013 at 5:32 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Rob, I was wondering something. Are you a commiter working on improving the repair or something similar ? I am not a committer [1], but I have an active interest in potential improvements to the best practices for repair. The specific change that I am considering is a modification to the default gc_grace_seconds value, which seems picked out of a hat at 10 days. My view is that the current implementation of repair has such negative performance consequences that I do not believe that holding onto tombstones for longer than 10 days could possibly be as bad as the fixed cost of running repair once every 10 days. I believe that this value is too low for a default (it also does not map cleanly to the work week!) and likely should be increased to 14, 21 or 28 days. Anyway, if a commiter (or any other expert) could give us some feedback on our comments (Are we doing well or not, whether things we observe are normal or unexplained, what is going to be improved in the future about repair...) 1) you are doing things according to best practice 2) unfortunately your experience with significantly degraded performance, including a blocked go-live due to repair bloat is pretty typical 3) the things you are experiencing are part of the current implementation of repair and are also typical, however I do not believe they are fully explained [2] 4) as has been mentioned further down thread, there are discussions regarding (and some already committed) improvements to both the current repair paradigm and an evolution to a new paradigm Thanks to all for the responses so far, please keep them coming! :D =Rob [1] hence the (unofficial) tag for this thread. I do have minor patches accepted to the codebase, but always merged by an actual committer. :) [2] driftx@#cassandra feels that these things are explained/understood by core team, and points to https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful approach to minimize same.
Re: best practices on EC2 question
On May 16, 2013, at 17:05 , Brian Tarbox tar...@cabotresearch.com wrote: An alternative that we had explored for a while was to do a two stage backup: 1) copy a C* snapshot from the ephemeral drive to an EBS drive 2) do an EBS snapshot to S3. The idea being that EBS is quite reliable, S3 is still the emergency backup and copying back from EBS to ephemeral is likely much faster than the 15 MB/sec we get from S3. Yup, this is what we do. We use rsync with --bwlimit=4000 to copy the snapshots from the eph drive to EBS; this is intentionally very low so that the backup process does not take eat our I/O. This is on m1.xlarge instances; YMMV so measure :). EBS drives are then snapshot with ec2-consistent-snapshot and then old snapshots expired using ec2-expire-snapshots (I believe these scripts are from Alestic). /Janne
Re: (unofficial) Community Poll for Production Operators : Repair
I indeed had some of those in the past. But my point is not that much to understand how I can get different counts depending on the node (I consider this as a weakness of counters and I am aware of it), my wonder is more why those inconsistent, distinct counters never converge even after a repair. Your last comment on this JIRA summarize quite well our problem. I hope that commiters will find out something. 2013/5/16 Janne Jalkanen janne.jalka...@ecyrd.com Might you be experiencing this? https://issues.apache.org/jira/browse/CASSANDRA-4417 /Janne On May 16, 2013, at 14:49 , Alain RODRIGUEZ arodr...@gmail.com wrote: @Rob: Thanks about the feedback. Yet I have a weird behavior still unexplained about repairing. Are counters supposed to be repaired too ? I mean, while reading at CL.ONE I can have different values depending on what node is answering. Even after a read repair or a full repair. Shouldn't a repair fix these discrepancies ? The only way I found to get always the same count is to read data at CL.QUORUM, but this is a workaround since the data itself remains wrong on some nodes. Any clue on it ? Alain 2013/5/15 Edward Capriolo edlinuxg...@gmail.com http://basho.com/introducing-riak-1-3/ Introduced Active Anti-Entropy. Riak now has active anti-entropy. In distributed systems, inconsistencies can arise between replicas due to failure modes, concurrent updates, and physical data loss or corruption. Pre-1.3 Riak already had several features for repairing this “entropy”, but they all required some form of user intervention. Riak 1.3 introduces automatic, self-healing properties that repair entropy on an ongoing basis. On Wed, May 15, 2013 at 5:32 PM, Robert Coli rc...@eventbrite.comwrote: On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Rob, I was wondering something. Are you a commiter working on improving the repair or something similar ? I am not a committer [1], but I have an active interest in potential improvements to the best practices for repair. The specific change that I am considering is a modification to the default gc_grace_seconds value, which seems picked out of a hat at 10 days. My view is that the current implementation of repair has such negative performance consequences that I do not believe that holding onto tombstones for longer than 10 days could possibly be as bad as the fixed cost of running repair once every 10 days. I believe that this value is too low for a default (it also does not map cleanly to the work week!) and likely should be increased to 14, 21 or 28 days. Anyway, if a commiter (or any other expert) could give us some feedback on our comments (Are we doing well or not, whether things we observe are normal or unexplained, what is going to be improved in the future about repair...) 1) you are doing things according to best practice 2) unfortunately your experience with significantly degraded performance, including a blocked go-live due to repair bloat is pretty typical 3) the things you are experiencing are part of the current implementation of repair and are also typical, however I do not believe they are fully explained [2] 4) as has been mentioned further down thread, there are discussions regarding (and some already committed) improvements to both the current repair paradigm and an evolution to a new paradigm Thanks to all for the responses so far, please keep them coming! :D =Rob [1] hence the (unofficial) tag for this thread. I do have minor patches accepted to the codebase, but always merged by an actual committer. :) [2] driftx@#cassandra feels that these things are explained/understood by core team, and points to https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful approach to minimize same.
Re: Major compaction does not seems to free the disk space a lot if wide rows are used.
Boris, We hit exactly the same issue, and you are correct the newly created SSTables are the cause of why most of the column-tombstone not being purged. There is an improvement in 1.2 train where both the minimum and maximum timestamp for a row is now stored and used during the compaction to determine if the portion of the row can be purged. However, this only appears to help Major compaction has the other restriction where all the files encompassing the deleted rows must be part of the compaction for the row to be purged still remains. We have switched to column delete rather that row delete wherever practical. A little more work on the app, but a big improvement in reads due to much more efficient compaction. Regards, Jacques From: Boris Yen yulin...@gmail.commailto:yulin...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 04:07 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, d...@cassandra.apache.orgmailto:d...@cassandra.apache.org d...@cassandra.apache.orgmailto:d...@cassandra.apache.org Subject: Major compaction does not seems to free the disk space a lot if wide rows are used. Hi All, Sorry for the wide distribution. Our cassandra is running on 1.0.10. Recently, we are facing a weird situation. We have a column family containing wide rows (each row might have a few million of columns). We delete the columns on a daily basis and we also run major compaction on it everyday to free up disk space (the gc_grace is set to 600 seconds). However, every time we run the major compaction, only 1 or 2GB disk space is freed. We tried to delete most of the data before running compaction, however, the result is pretty much the same. So, we tried to check the source code. It seems that the column tombstones could only be purged when the row key is not in other sstables. I know the major compaction should include all sstables, however, in our use case, columns get inserted rapidly. This will make the cassandra flush the memtables to disk and create new sstables. The newly created sstables will have the same keys as the sstables that are being compacted (the compaction will take 2 or 3 hours to finish). My question is that will these newly created sstables be the cause of why most of the column-tombstone not being purged? p.s. We also did some other tests. We inserted data to the same CF with the same wide-row pattern and deleted most of the data. This time we stopped all the writes to cassandra and did the compaction. The disk usage decreased dramatically. Any suggestions or is this a know issue. Thanks and Regards, Boris
Re: SSTable size versus read performance
My 5 cents: I'd check blockdev --getra for data drives - too high values for readahead (default to 256 for debian) can hurt read performance. On 05/16/2013 05:14 PM, Keith Wright wrote: Hi all, I currently have 2 clusters, one running on 1.1.10 using CQL2 and one running on 1.2.4 using CQL3 and Vnodes. The machines in the 1.2.4 cluster are expected to have better IO performance as we are going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with higher end drives (commit logs are on their own disk shared with the OS). I am doing some stress testing on the 1.2 cluster and have found that although the reads / sec as seen from iostat are approximately the same (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2). As a result, I am seeing excessive iowait in the 1.2 cluster causing high average read times of 30 ms under the same load (1.1 cluster sees around 5 ms). They are both using Leveled compaction but one thing I did change in the new cluster was to increase the sstable size from the OOTB setting to 32 MB. Note that my reads are by definition highly random as we are running memcached in front for various reasons. Does cassandra need to read the entire SSTable when fetching a row or only the relevant chunk (I have the OOTB chunk size and BF settings)? I just decreased the sstable size to 5 MB and am waiting for compactions to complete to see if that makes a difference. Thanks! Relevant table definition if helpful (note that I also changed to the LZ4 compressor expecting better read performance and I decreased the crc change again to minimize read latency): CREATE TABLE global_user ( user_id BIGINT, app_id INT, type TEXT, name TEXT, last TIMESTAMP, paid BOOLEAN, values mapTIMESTAMP,FLOAT, sku_time mapTEXT,TIMESTAMP, extra_param mapTEXT,TEXT, PRIMARY KEY (user_id, app_id, type, name) ) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and compaction={'class':'LeveledCompactionStrategy'} and compaction_strategy_options = {'sstable_size_in_mb':5} and gc_grace_seconds = 86400;
Re: SSTable size versus read performance
We actually have it set to 512. I have tried decreasing my SSTable size to 5 MB and changing the chunk size to 8 kb (and run an sstableupgrade to ensure they took effect) but am still seeing similar performance. Is anyone running lz4 compression in production? I'm thinking of reverting back to snappy to see if that makes a difference. I appreciate all of the help! From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 1:55 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: SSTable size versus read performance My 5 cents: I'd check blockdev --getra for data drives - too high values for readahead (default to 256 for debian) can hurt read performance. On 05/16/2013 05:14 PM, Keith Wright wrote: Hi all, I currently have 2 clusters, one running on 1.1.10 using CQL2 and one running on 1.2.4 using CQL3 and Vnodes. The machines in the 1.2.4 cluster are expected to have better IO performance as we are going from 1 SSD data disk per node in the 1.1 cluster to 3 SSD data disks per node in the 1.2 cluster with higher end drives (commit logs are on their own disk shared with the OS). I am doing some stress testing on the 1.2 cluster and have found that although the reads / sec as seen from iostat are approximately the same (3K / sec) in both clusters, the MB/s read in the new cluster is MUCH higher (7 MB/s in 1.1 as compared to 30-50 MB/s in 1.2). As a result, I am seeing excessive iowait in the 1.2 cluster causing high average read times of 30 ms under the same load (1.1 cluster sees around 5 ms). They are both using Leveled compaction but one thing I did change in the new cluster was to increase the sstable size from the OOTB setting to 32 MB. Note that my reads are by definition highly random as we are running memcached in front for various reasons. Does cassandra need to read the entire SSTable when fetching a row or only the relevant chunk (I have the OOTB chunk size and BF settings)? I just decreased the sstable size to 5 MB and am waiting for compactions to complete to see if that makes a difference. Thanks! Relevant table definition if helpful (note that I also changed to the LZ4 compressor expecting better read performance and I decreased the crc change again to minimize read latency): CREATE TABLE global_user ( user_id BIGINT, app_id INT, type TEXT, name TEXT, last TIMESTAMP, paid BOOLEAN, values mapTIMESTAMP,FLOAT, sku_time mapTEXT,TIMESTAMP, extra_param mapTEXT,TEXT, PRIMARY KEY (user_id, app_id, type, name) ) with compression={'crc_check_chance':0.1,'sstable_compression':'LZ4Compressor'} and compaction={'class':'LeveledCompactionStrategy'} and compaction_strategy_options = {'sstable_size_in_mb':5} and gc_grace_seconds = 86400;
Re: SSTable size versus read performance
512 sectors for read-ahead. Are your new fancy SSD drives using large sectors? If your read-ahead is really reading 512 x 4KB per random IO, then that 2 MB per read seems like a lot of extra overhead. -Bryan On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.com wrote: We actually have it set to 512. I have tried decreasing my SSTable size to 5 MB and changing the chunk size to 8 kb From: Igor i...@4friends.od.ua Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, May 16, 2013 1:55 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: SSTable size versus read performance My 5 cents: I'd check blockdev --getra for data drives - too high values for readahead (default to 256 for debian) can hurt read performance.
Re: SSTable size versus read performance
I was going to say something similar I feel like the SSD drives read much more then the standard drive. Read Ahead/arge sectors could and probably does explain it. On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot btal...@aeriagames.comwrote: 512 sectors for read-ahead. Are your new fancy SSD drives using large sectors? If your read-ahead is really reading 512 x 4KB per random IO, then that 2 MB per read seems like a lot of extra overhead. -Bryan On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.comwrote: We actually have it set to 512. I have tried decreasing my SSTable size to 5 MB and changing the chunk size to 8 kb From: Igor i...@4friends.od.ua Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, May 16, 2013 1:55 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: SSTable size versus read performance My 5 cents: I'd check blockdev --getra for data drives - too high values for readahead (default to 256 for debian) can hurt read performance.
Re: SSTable size versus read performance
just in case it will be useful to somebody - here is my checklist for better read performance from SSD 1. limit read-ahead to 16 or 32 2. enable 'trickle_fsync' (available starting from cassandra 1.1.x) 3. use 'deadline' io-scheduler (much more important for rotational drives then for SSD) 4. format data partition starting on 2048 sector boundary 5. use ext4 with noatime,nodiratime,discard mount options On 05/16/2013 10:48 PM, Edward Capriolo wrote: I was going to say something similar I feel like the SSD drives read much more then the standard drive. Read Ahead/arge sectors could and probably does explain it. On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot btal...@aeriagames.com mailto:btal...@aeriagames.com wrote: 512 sectors for read-ahead. Are your new fancy SSD drives using large sectors? If your read-ahead is really reading 512 x 4KB per random IO, then that 2 MB per read seems like a lot of extra overhead. -Bryan On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.com mailto:kwri...@nanigans.com wrote: We actually have it set to 512. I have tried decreasing my SSTable size to 5 MB and changing the chunk size to 8 kb From: Igor i...@4friends.od.ua mailto:i...@4friends.od.ua Reply-To: user@cassandra.apache.org mailto:user@cassandra.apache.org user@cassandra.apache.org mailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 1:55 PM To: user@cassandra.apache.org mailto:user@cassandra.apache.org user@cassandra.apache.org mailto:user@cassandra.apache.org Subject: Re: SSTable size versus read performance My 5 cents: I'd check blockdev --getra for data drives - too high values for readahead (default to 256 for debian) can hurt read performance.
Re: Major compaction does not seems to free the disk space a lot if wide rows are used.
This makes sense. Unless you are running major compaction a delete could only happen if the bloom filters confirmed the row was not in the sstables not being compacted. If your rows are wide the odds are that they are in most/all sstables and then finally removing them would be tricky. On Thu, May 16, 2013 at 12:00 PM, Louvet, Jacques jacques_lou...@cable.comcast.com wrote: Boris, We hit exactly the same issue, and you are correct the newly created SSTables are the cause of why most of the column-tombstone not being purged. There is an improvement in 1.2 train where both the minimum and maximum timestamp for a row is now stored and used during the compaction to determine if the portion of the row can be purged. However, this only appears to help Major compaction has the other restriction where all the files encompassing the deleted rows must be part of the compaction for the row to be purged still remains. We have switched to column delete rather that row delete wherever practical. A little more work on the app, but a big improvement in reads due to much more efficient compaction. Regards, Jacques From: Boris Yen yulin...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, May 16, 2013 04:07 To: user@cassandra.apache.org user@cassandra.apache.org, d...@cassandra.apache.org d...@cassandra.apache.org Subject: Major compaction does not seems to free the disk space a lot if wide rows are used. Hi All, Sorry for the wide distribution. Our cassandra is running on 1.0.10. Recently, we are facing a weird situation. We have a column family containing wide rows (each row might have a few million of columns). We delete the columns on a daily basis and we also run major compaction on it everyday to free up disk space (the gc_grace is set to 600 seconds). However, every time we run the major compaction, only 1 or 2GB disk space is freed. We tried to delete most of the data before running compaction, however, the result is pretty much the same. So, we tried to check the source code. It seems that the column tombstones could only be purged when the row key is not in other sstables. I know the major compaction should include all sstables, however, in our use case, columns get inserted rapidly. This will make the cassandra flush the memtables to disk and create new sstables. The newly created sstables will have the same keys as the sstables that are being compacted (the compaction will take 2 or 3 hours to finish). My question is that will these newly created sstables be the cause of why most of the column-tombstone not being purged? p.s. We also did some other tests. We inserted data to the same CF with the same wide-row pattern and deleted most of the data. This time we stopped all the writes to cassandra and did the compaction. The disk usage decreased dramatically. Any suggestions or is this a know issue. Thanks and Regards, Boris
Re: SSTable size versus read performance
Thank you for that. I did not have trickle_fsync enabled and will give it a try. I just noticed that when running a describe on my table, I do not see the sstable size parameter (compaction_strategy_options = {'sstable_size_in_mb':5}) included. Is that expected? Does it mean its using the defaults? Assuming none of the tuning here makes a noticeable difference, my next step is to try switching from LZ4 to Snappy. Any opinions on that? Thanks! CREATE TABLE global_user ( user_id bigint, app_id int, type text, name text, extra_param maptext, text, last timestamp, paid boolean, sku_time maptext, timestamp, values maptimestamp, float, PRIMARY KEY (user_id, app_id, type, name) ) WITH bloom_filter_fp_chance=0.10 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=86400 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'LeveledCompactionStrategy'} AND compression={'chunk_length_kb': '8', 'crc_check_chance': '0.1', 'sstable_compression': 'LZ4Compressor'}; From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 4:27 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: SSTable size versus read performance just in case it will be useful to somebody - here is my checklist for better read performance from SSD 1. limit read-ahead to 16 or 32 2. enable 'trickle_fsync' (available starting from cassandra 1.1.x) 3. use 'deadline' io-scheduler (much more important for rotational drives then for SSD) 4. format data partition starting on 2048 sector boundary 5. use ext4 with noatime,nodiratime,discard mount options On 05/16/2013 10:48 PM, Edward Capriolo wrote: I was going to say something similar I feel like the SSD drives read much more then the standard drive. Read Ahead/arge sectors could and probably does explain it. On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot btal...@aeriagames.commailto:btal...@aeriagames.com wrote: 512 sectors for read-ahead. Are your new fancy SSD drives using large sectors? If your read-ahead is really reading 512 x 4KB per random IO, then that 2 MB per read seems like a lot of extra overhead. -Bryan On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.commailto:kwri...@nanigans.com wrote: We actually have it set to 512. I have tried decreasing my SSTable size to 5 MB and changing the chunk size to 8 kb From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, May 16, 2013 1:55 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: SSTable size versus read performance My 5 cents: I'd check blockdev --getra for data drives - too high values for readahead (default to 256 for debian) can hurt read performance.
Re: SSTable size versus read performance
lz4 is supposed to achieve similar compression while using less resources then snappy. It is easy to test, just change then run a 'nodetool rebuild' . Not sure when lz4 was introduced but being that it is new to cassandra there may not be many large deployments running it yet. On Thu, May 16, 2013 at 4:40 PM, Keith Wright kwri...@nanigans.com wrote: Thank you for that. I did not have trickle_fsync enabled and will give it a try. I just noticed that when running a describe on my table, I do not see the sstable size parameter (compaction_strategy_options = {'sstable_size_in_mb':5}) included. Is that expected? Does it mean its using the defaults? Assuming none of the tuning here makes a noticeable difference, my next step is to try switching from LZ4 to Snappy. Any opinions on that? Thanks! CREATE TABLE global_user ( user_id bigint, app_id int, type text, name text, extra_param maptext, text, last timestamp, paid boolean, sku_time maptext, timestamp, values maptimestamp, float, PRIMARY KEY (user_id, app_id, type, name) ) WITH bloom_filter_fp_chance=0.10 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=86400 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'LeveledCompactionStrategy'} AND compression={'chunk_length_kb': '8', 'crc_check_chance': '0.1', 'sstable_compression': 'LZ4Compressor'}; From: Igor i...@4friends.od.ua Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, May 16, 2013 4:27 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: SSTable size versus read performance just in case it will be useful to somebody - here is my checklist for better read performance from SSD 1. limit read-ahead to 16 or 32 2. enable 'trickle_fsync' (available starting from cassandra 1.1.x) 3. use 'deadline' io-scheduler (much more important for rotational drives then for SSD) 4. format data partition starting on 2048 sector boundary 5. use ext4 with noatime,nodiratime,discard mount options On 05/16/2013 10:48 PM, Edward Capriolo wrote: I was going to say something similar I feel like the SSD drives read much more then the standard drive. Read Ahead/arge sectors could and probably does explain it. On Thu, May 16, 2013 at 3:43 PM, Bryan Talbot btal...@aeriagames.comwrote: 512 sectors for read-ahead. Are your new fancy SSD drives using large sectors? If your read-ahead is really reading 512 x 4KB per random IO, then that 2 MB per read seems like a lot of extra overhead. -Bryan On Thu, May 16, 2013 at 12:35 PM, Keith Wright kwri...@nanigans.comwrote: We actually have it set to 512. I have tried decreasing my SSTable size to 5 MB and changing the chunk size to 8 kb From: Igor i...@4friends.od.ua Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, May 16, 2013 1:55 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: SSTable size versus read performance My 5 cents: I'd check blockdev --getra for data drives - too high values for readahead (default to 256 for debian) can hurt read performance.
Re: Upgrade 1.1.10 - 1.2.4
But the problem is that I would like to use Cassandra embeeded? This is not possible any more? 2013/5/15 Edward Capriolo edlinuxg...@gmail.com You are doing something wrong. What I was suggesting is only a hack for unit tests. Your not supposed to interact with CassandraServer directly like that as a client. Download hector and use the correct client libraries. On Wed, May 15, 2013 at 5:13 PM, Everton Lima peitin.inu...@gmail.comwrote: But using this code: ThriftSessionManager.instance.setCurrentSocket(new InetSocketAddress(9160)); I will need to execute this line every time that I need to do somiething in Cassandra? Like update a collunm family. Thanks for reply. 2013/5/15 Edward Capriolo edlinuxg...@gmail.com If you are using hector it can setup the embedded server properly. When using the server directly inside cassandra I have run into a similar problem.. https://github.com/edwardcapriolo/cassandra/blob/range-tombstone-thrift/test/unit/org/apache/cassandra/thrift/EndToEndTest.java @BeforeClass public static void setup() throws IOException, InvalidRequestException, TException{ Schema.instance.clear(); // Schema are now written on disk and will be reloaded new EmbeddedCassandraService().start(); ThriftSessionManager.instance.setCurrentSocket(new InetSocketAddress(9160)); server = new CassandraServer(); server.set_keyspace(Keyspace1); } On Wed, May 15, 2013 at 4:24 PM, Everton Lima peitin.inu...@gmail.comwrote: Hello, someone can help me to use the Object CassandraServer() in version 1.2.4?? I was using this in version 1.1.10, and thats work, but was happening something that I can not solve (sometimes my cpu up to 100% and stay forever) so I decide to do the upgrade. I start the cassandra with EmbeededCassandraServer. The actual error is: when the code call public ThriftClientState currentSession() { SocketAddress socket = remoteSocket.get(); assert socket != null; ThriftClientState cState = activeSocketSessions.get(socket); if (cState == null) { cState = new ThriftClientState(); activeSocketSessions.put(socket, cState); } return cState; } the variable socket is null. This methos is calling with: CassandraServer cs = new CassandraServer(); cs.describe_keyspace() -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA
Re: Exception when running YCSB and Cassandra
You're nodes are overloaded. I'd recommend using m1.xlarge instead. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 15/05/2013, at 1:59 PM, Rodrigo Felix rodrigofelixdealme...@gmail.com wrote: Hi, I'm executing a workload on YCSB (50% read, 50% update) and after few minutes I get the following exception: TimedOutException() at org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7174) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:540) at org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:512) at com.yahoo.ycsb.db.CassandraClient10.read(CassandraClient10.java:259) at com.yahoo.ycsb.DBWrapper.read(DBWrapper.java:84) at com.yahoo.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:469) at com.yahoo.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:425) at com.yahoo.ycsb.ClientThread.run(ClientThread.java:105) I have 2 seeds on Amazon EC2 (large instance) and depending on the demand, I add (or remove) new large instances. Any suggestion to solve this problem or to tune cassandra? Follows further info about cassandra installed. Thanks in advance. INFO 00:54:05,591 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.7.0_07 INFO 00:54:05,592 Heap size: 1931476992/1931476992 INFO 00:54:07,447 Cassandra version: 1.1.5 INFO 00:54:07,448 Thrift API version: 19.32.0 Att. Rodrigo Felix de Almeida LSBD - Universidade Federal do Ceará Project Manager MBA, CSM, CSPO, SCJP
Re:
Try the IRC room for the java driver or submit a ticket on the JIRA system, see the links here https://github.com/datastax/java-driver Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 15/05/2013, at 5:50 PM, bjbylh bjb...@me.com wrote: hello all: i use datastax java-driver to connect c* ,when the program calls cluster.shutdown(),it prints out:java.lang.NoSuchMethodError:org.jboss.netty.channelFactory.shutdown()V. but i do not kown why... c* is 1.2.4,java-driver is 1.0.0 thank you. Sent from Samsung Mobile
Re: how to access data only on specific node
Are you using a multi get or a range slice ? Read Repair does not run for range slice queries. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 15/05/2013, at 6:51 PM, Sergey Naumov sknau...@gmail.com wrote: see that RR works, but sometimes number of records have been read degrades. RR is enabled on a random 10% of requests, see the read_repair_chance setting for the CF. OK, but I forgot to mention the main thing - each node in my config is a standalone datacenter and distribution is DC1:1, DC2:1, DC3:1. So when I try to read 1000 records with consistency ONE multiple times while connected to node that just have been turned on, I got the following count of records read (approximately): 120 220 310 390 950 960 965 !! 955 !! 970 ... If all other nodes contain 1000 records and read repair already delivered 965 records to local DC (and so - local node), why sometimes I see degradation of total records read? 2013/5/15 aaron morton aa...@thelastpickle.com see that RR works, but sometimes number of records have been read degrades. RR is enabled on a random 10% of requests, see the read_repair_chance setting for the CF. If so, then the question is: how to perform local reads to examine content of specific node? You can check which nodes are replicas for a key using nodetool getendpoints If you want to read all the rows for a particular row you need to use a range scan and limit it by the token ranges assigned to the node. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 14/05/2013, at 10:29 PM, Sergey Naumov sknau...@gmail.com wrote: Hello. I'am playing with demo cassandra cluster and decided to test read repair + hinted handoff. One node of a cluster was put down deliberately, and on the other nodes I inserted some records (say 1000). HH is off on all nodes. Then I turned on the node, connected to it with cql (locally, so to localhost) and performed 1000 reads by row key (with consistency ONE). I see that RR works, but sometimes number of records have been read degrades. Is it because consistency ONE and local reads is not the same thing? If so, then the question is: how to perform local reads to examine content of specific node? Thanks in advance, Sergey Naumov.
Re: The action of the file system at drop column family execution
When drop column family is executed irrespective of the existence of generation of Snapshot, $KS/$CF/ directory certainly remains. I don't think there is any code there to delete the empty directories. We only care about the files in there. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 15/05/2013, at 7:41 PM, hiroshi.kise...@hitachi.com wrote: Dear Aaron Morton I'm Hiroshi. Thank you for the reply. $KS/$CF/snapshots directory namely : Under C:\var\lib\cassandra\data\MyKeyspace\testcf1 dir command execution: 2013/05/09 14:04DIR . 2013/05/09 14:04DIR .. 0 File(s) 0 bytes 2 Dir(s) 139,587,530,752 bytes free snapshot was not generated. It is a repeated question (I am sorry). When drop column family is executed irrespective of the existence of generation of Snapshot, $KS/$CF/ directory certainly remains. It is a meaning? -- Hiroshi Kise Date: Wed, 15 May 2013 05:32:50 +0900, aaron morton aa...@thelastpickle.com wrote; --- Begin of replied message -- * Although a directory (column family:testcf1) remains, is it satisfactory on a file system? A snapshot is taken when a truncate or drop command is run. You should see a $KS/$CF/snapshots directory. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 15/05/2013, at 12:45 AM, hiroshi.kise...@hitachi.com wrote: Hi everyone. Although it may be a stupid question, please give me instruction. [1.Question] A column family is deleted (drop column family testcf1;). As the upper result (dir command execution), DIR testcf1 0 File(s) 0 bytes 3 Dir(s) 139,587,596,288 bytes free * As Cassandra, it is the right action? * Although a directory (column family:testcf1) remains, is it satisfactory on a file system? -- [2. Pre-processing] ( Cassandra-CLI was used. ) First, a key space is created (create keyspace MyKeyspace;), The key space was chosen (use MyKeyspace;). And the column family created (create column family testcf1;). -- [3.Environment] Cassandra 1.2.4 OS: Windows 7 Thank you for your consideration. -- Hiroshi Kise End of replied message ---
Re: How to add new DC to cluster when GossipingPropertyFileSnitch is used
You should configure the seeds as recommended regardless of the snitch used. You need to update the yaml file to start using the GossipingPropertyFileSnitch but after that it reads the cassandra-rackdc.properties file to get information about the node. It reads uses the information in gossip to get information about the other nodes in the cluster. If there is no info in gossip about a remote node, because say it has not been upgraded, it will fall back to using cassandra-topology.properties. Hope that helps. - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 15/05/2013, at 8:10 PM, Sergey Naumov sknau...@gmail.com wrote: As far as I understand, GossipingPropertyFileSnitch supposed to provide more flexibility in nodes addition/removal. But what about addition of a DC? In datastax documentation (http://www.datastax.com/docs/1.2/operations/add_replace_nodes#add-dc) it is said that cassandra-topology.properties could be updated without restart for PropertyFileSnitch. But here (http://www.datastax.com/docs/1.0/initialize/cluster_init_multi_dc) it it said, that you MUST include at least one node from EACH data center. It is a best practice to have at more than one seed node per data center and the seed list should be the same for each node. At the first glance it seems that PropertyFileSnitch will get necessary info from cassandra-topology.properties, but for GossipingPropertyFileSnitch modification of cassandra.yaml and restart of all nodes in all DCs will be required. Could somebody clarify this topic? Thanks in advance, Sergey Naumov.
Re: Multiple cursors
We don't have cursors in the RDBMS sense of things. If you are using thrift the recommendation is to use connection pooling and re-use connections for different requests. Note that you can not multiplex queries over the same thrift connection, you must wait for the response before issuing another request. The native binary transport allows multiplexing though. In general you should use one of the pre build client libraries as they will take care of connection pooling etc for you https://wiki.apache.org/cassandra/ClientOptions Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 16/05/2013, at 9:03 AM, Sam Mandes eng.salaman...@gmail.com wrote: Hello All, Is using multiple cursors simultaneously on the same C* connection a good practice? I've an internal api for a project running thrift, I then need to query something from C*. I do not like to create a new connection for every api request. Thus, when my service initially starts I open a connection to C* and with every request I create a new cursor. Thanks a lot
Re: C++ Thrift client
(Assuming you have enabled tcp_nodelay on the client socket) Check the server side latency, using nodetool cfstats or nodetool cfhistograms. Check the logs for messages from the GCInspector about ParNew pauses. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 16/05/2013, at 12:58 PM, Bill Hastings bllhasti...@gmail.com wrote: Hi All I am doing very small inserts into Cassandra in the range of say 64 bytes. I use a C++ Thrift client and seem consistently get latencies anywhere between 35-45 ms. Could some one please advise as to what might be happening? thanks
Re:
what version of netty is on your classpath? On 05/16/2013 07:33 PM, aaron morton wrote: Try the IRC room for the java driver or submit a ticket on the JIRA system, see the links here https://github.com/datastax/java-driver Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 15/05/2013, at 5:50 PM, bjbylh bjb...@me.com mailto:bjb...@me.com wrote: hello all: i use datastax java-driver to connect c* ,when the program calls cluster.shutdown(),it prints out:java.lang.NoSuchMethodError:org.jboss.netty.channelFactory.shutdown()V. but i do not kown why... c* is 1.2.4,java-driver is 1.0.0 thank you. Sent from Samsung Mobile
Re: Decommission nodes starts to appear from one node (1.0.11)
Thanks. This is kind of a expert advice for me. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Decommission-nodes-starts-to-appear-from-one-node-1-0-11-tp7587842p7587876.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: pycassa failures in large batch cycling
On Tue, 14 May 2013, aaron morton wrote: After several cycles, pycassa starts getting connection failures. Do you have the error stack ?Are the TimedOutExceptions or socket time outs or something else. I figured out the problem here and made this ticket in jira: https://issues.apache.org/jira/browse/CASSANDRA-5575 Summary: the Thrift interfaces to Cassandra are simply not able to load large batches without putting the client into an infinite retry loop. Seems that the only robust solutions involve either features added to Thrift and all Cassandra clients, or a new interface mechanism. jrf
Re: Upgrade 1.1.10 - 1.2.4
Please give an example of the code you are trying to execute. On Thu, May 16, 2013 at 6:26 PM, Everton Lima peitin.inu...@gmail.comwrote: But the problem is that I would like to use Cassandra embeeded? This is not possible any more? 2013/5/15 Edward Capriolo edlinuxg...@gmail.com You are doing something wrong. What I was suggesting is only a hack for unit tests. Your not supposed to interact with CassandraServer directly like that as a client. Download hector and use the correct client libraries. On Wed, May 15, 2013 at 5:13 PM, Everton Lima peitin.inu...@gmail.comwrote: But using this code: ThriftSessionManager.instance.setCurrentSocket(new InetSocketAddress(9160)); I will need to execute this line every time that I need to do somiething in Cassandra? Like update a collunm family. Thanks for reply. 2013/5/15 Edward Capriolo edlinuxg...@gmail.com If you are using hector it can setup the embedded server properly. When using the server directly inside cassandra I have run into a similar problem.. https://github.com/edwardcapriolo/cassandra/blob/range-tombstone-thrift/test/unit/org/apache/cassandra/thrift/EndToEndTest.java @BeforeClass public static void setup() throws IOException, InvalidRequestException, TException{ Schema.instance.clear(); // Schema are now written on disk and will be reloaded new EmbeddedCassandraService().start(); ThriftSessionManager.instance.setCurrentSocket(new InetSocketAddress(9160)); server = new CassandraServer(); server.set_keyspace(Keyspace1); } On Wed, May 15, 2013 at 4:24 PM, Everton Lima peitin.inu...@gmail.comwrote: Hello, someone can help me to use the Object CassandraServer() in version 1.2.4?? I was using this in version 1.1.10, and thats work, but was happening something that I can not solve (sometimes my cpu up to 100% and stay forever) so I decide to do the upgrade. I start the cassandra with EmbeededCassandraServer. The actual error is: when the code call public ThriftClientState currentSession() { SocketAddress socket = remoteSocket.get(); assert socket != null; ThriftClientState cState = activeSocketSessions.get(socket); if (cState == null) { cState = new ThriftClientState(); activeSocketSessions.put(socket, cState); } return cState; } the variable socket is null. This methos is calling with: CassandraServer cs = new CassandraServer(); cs.describe_keyspace() -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA
Announcing Mutagen
Mutagen Cassandra is a framework providing schema versioning and mutation for Apache Cassandra. It is similar to Flyway for SQL databases. https://github.com/toddfast/mutagen-cassandra Mutagen is a lightweight framework for applying versioned changes (known as mutations) to a resource, in this case a Cassandra schema. Mutagen takes into account the resource's existing state and only applies changes that haven't yet been applied. Schema mutation with Mutagen helps you make manageable changes to the schema of live Cassandra instances as you update your client software, and is especially useful when used across development, test, staging, and production environments to automatically keep schemas updated. This is a minimal but functional initial release, and I appreciate bug reports, suggestions and pull requests. Best, Todd