Re: Nodetool ring and Replicas after 1.2 upgrade
Thanks Jason, No errors in the log. Also the nodes do have a consistent schema for the keyspace (although this was a problem during the upgrade that we resolved using the procedure specified here: https://wiki.apache.org/cassandra/FAQ#schema_disagreement). -Mike From: Jason Wee peich...@gmail.com To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com Sent: Tuesday, June 16, 2015 12:07 AM Subject: Re: Nodetool ring and Replicas after 1.2 upgrade maybe check the system.log to see if there is any exception and/or error? check as well if they are having consistent schema for the keyspace? hth jason On Tue, Jun 16, 2015 at 7:17 AM, Michael Theroux mthero...@yahoo.com wrote: Hello, We (finally) have just upgraded from Cassandra 1.1 to Cassandra 1.2.19. Everything appears to be up and running normally, however, we have noticed unusual output from nodetool ring. There is a new (to us) field Replicas in the nodetool output, and this field, seemingly at random, is changing from 2 to 3 and back to 2. We are using the byte ordered partitioner (we hash our own keys), and have a replication factor of 3. We are also on AWS and utilize the Ec2snitch on a single Datacenter. Other calls appear to be normal. nodetool getEndpoints returns the proper endpoints when querying various keys, nodetool ring and status return that all nodes appear healthy. Anyone have any hints on what maybe happening, or if this is a problem we should be concerned with? Thanks,-Mike
Re: Nodetool ring and Replicas after 1.2 upgrade
After looking at the cassandra code a little, I believe this is not really an issue. After the upgrade to 1.2, we still see the issue described in this bug I filed: https://issues.apache.org/jira/browse/CASSANDRA-5264 The Replicas is calculated by adding up the effective ownership of all the nodes, and chopping off the remainder. So if your effective ownership is 299.99%, it appears the code will will report the number of replicas as 2. This might become reliably 3 after I complete running repairs after upgrade. Thanks for you time,-Mike From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com Sent: Tuesday, June 16, 2015 4:43 PM Subject: Re: Nodetool ring and Replicas after 1.2 upgrade Hi Michael, I barely can access internet right now and was not able to check outputs on my computer, yet first thing that come to my mind is that since 1.2.x (and vnodes) I use rather nodetool status instead. What is the nodetool status output ? Also did you try to specify the keyspace ? Since RF is a per keyspace value, maybe this would help. Other than that, I don't have any idea. I don't remember anything similar, but it was a while ago. I have to ask... Why staying so much behind the current stable / production ready version ? C*heers, Alain 2015-06-16 14:57 GMT+02:00 Michael Theroux mthero...@yahoo.com: Thanks Jason, No errors in the log. Also the nodes do have a consistent schema for the keyspace (although this was a problem during the upgrade that we resolved using the procedure specified here: https://wiki.apache.org/cassandra/FAQ#schema_disagreement). -Mike From: Jason Wee peich...@gmail.com To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com Sent: Tuesday, June 16, 2015 12:07 AM Subject: Re: Nodetool ring and Replicas after 1.2 upgrade maybe check the system.log to see if there is any exception and/or error? check as well if they are having consistent schema for the keyspace? hth jason On Tue, Jun 16, 2015 at 7:17 AM, Michael Theroux mthero...@yahoo.com wrote: Hello, We (finally) have just upgraded from Cassandra 1.1 to Cassandra 1.2.19. Everything appears to be up and running normally, however, we have noticed unusual output from nodetool ring. There is a new (to us) field Replicas in the nodetool output, and this field, seemingly at random, is changing from 2 to 3 and back to 2. We are using the byte ordered partitioner (we hash our own keys), and have a replication factor of 3. We are also on AWS and utilize the Ec2snitch on a single Datacenter. Other calls appear to be normal. nodetool getEndpoints returns the proper endpoints when querying various keys, nodetool ring and status return that all nodes appear healthy. Anyone have any hints on what maybe happening, or if this is a problem we should be concerned with? Thanks,-Mike
Nodetool ring and Replicas after 1.2 upgrade
Hello, We (finally) have just upgraded from Cassandra 1.1 to Cassandra 1.2.19. Everything appears to be up and running normally, however, we have noticed unusual output from nodetool ring. There is a new (to us) field Replicas in the nodetool output, and this field, seemingly at random, is changing from 2 to 3 and back to 2. We are using the byte ordered partitioner (we hash our own keys), and have a replication factor of 3. We are also on AWS and utilize the Ec2snitch on a single Datacenter. Other calls appear to be normal. nodetool getEndpoints returns the proper endpoints when querying various keys, nodetool ring and status return that all nodes appear healthy. Anyone have any hints on what maybe happening, or if this is a problem we should be concerned with? Thanks,-Mike
Re: VPC AWS
Hello Alain, We switched from EC2 to VPC a couple of years ago. The process for us was long, slow, and multi step for our (at the time) 6 node cluster. In our case, we don't need to consider multi-DC. However, in our infrastructure we were rapidly running out of IP addresses, and wished to move to VPC to give us a nearly inexhaustible supply. In addition, AWS VPC gives us an additional layer of security for our Cassandra cluster. To do this, we setup our VPC to have both private and public subnets. Public subnets were accessible to the Internet (when instances were assigned a public IP), while private subnets could not (although instances on the subnet could access the Internet via a NAT instance). We wished for to be Cassandra on the private subnet. However, this introduced a complication. EC2 instances would not be able to communicate directly to our VPC instances on a private subnet. So, to achieve this, while still having an operating Cassandra DB without downtime, we essentially had to stage Cassandra instances on our public subnet, assigning IPs and reconfiguring nodes until we had a mixed EC2/VPC Public subnet cluster, then start moving systems to the private subnet, continuing the process until all instances were on a private subnet. During the process we carefully orchestrated configuration like broadcast and seeds to make sure the cluster continued to function properly and all nodes could communicate with each other. We also had to carefully orchestrate the assigning of AWS security groups to make sure everyone could talk to each other during this process. Also keep in mind that the use of public IPs for communications will add to your AWS costs. During our transition we had to do this for a short time while EC2 instances were communicating with VPC instances, but we were able to switch to 100% internal IPs when we completed (you will still get inter availability zone charges regardless) This process was complex enough that I wrote detailed series of steps, for each node in our cluster. -Mike From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org Sent: Thursday, June 5, 2014 8:12 AM Subject: VPC AWS Hi guys, We are going to move from a cluster made of simple Amazon EC2 servers to a VPC cluster. We are using Cassandra 1.2.11 and I have some questions regarding this switch and the Cassandra configuration inside a VPC. Actually I found no documentation on this topic, but I am quite sure that some people are already using VPC. If you can point me to any documentation regarding VPC / Cassandra, it would be very nice of you. We have only one DC for now, but we need to remain multi DC compatible, since we will add DC very soon. Else, I would like to know if I should keep using EC2MultiRegionSnitch or change the snitch to anything else. What about broadcast/listen ip, seeds...? We currently use public ip as for broadcast address and for seeds. We use private ones for listen address. Machines inside the VPC will only have private IP AFAIK. Should I keep using a broadcast address ? Is there any other incidence when switching to a VPC ? Sorry if the topic was already discussed, I was unable to find any useful information...
Re: VPC AWS
Hello Alain, We switched from EC2 to VPC a couple of years ago. The process for us was long, slow and multi step. In our case, we don't need to consider multi-DC. However, in our infrastructure we were rapidly running out of IP addresses, and wished to move to VPC to give us a nearly inexhaustible supply. In addition, AWS VPC gives us an additional layer of security for our Cassandra cluster. To do this, we setup our VPC to have both private and public subnets. Public subnets were accessible to the Internet (when instances were assigned a public IP, while private subnets could not (although instances on the subnet could access the Internet via a NAT instance). We wished for Cassandra on the private subnet. So, to achieve this, while still having an operating Cassandra DB without downtime, we essentially had to stage Cassandra instances on our public subnet, assigning IPs and reconfiguring nodes until we had a mixed EC2/VPC Public cluster, then start moving systems to the private subnet, continuing the process until all instances were on a private subnet. During the process we carefully orchestrated configuration like broadcast and seeds to make sure the cluster continued to function properly. We also had to orchestrate the assigning of AWS security groups to make sure everyone could talk to each other during this process. Also keep in mind that the use of public IPs for communications will add to your AWS costs. During our transition we had to do this for a short time while EC2 instances were communicating with VPC instances, but we were able to switch to 100% internal IPs when we completed (you will still get inter availability zone charges regardless) In order to make this successful, I created a script out From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org Sent: Thursday, June 5, 2014 8:12 AM Subject: VPC AWS Hi guys, We are going to move from a cluster made of simple Amazon EC2 servers to a VPC cluster. We are using Cassandra 1.2.11 and I have some questions regarding this switch and the Cassandra configuration inside a VPC. Actually I found no documentation on this topic, but I am quite sure that some people are already using VPC. If you can point me to any documentation regarding VPC / Cassandra, it would be very nice of you. We have only one DC for now, but we need to remain multi DC compatible, since we will add DC very soon. Else, I would like to know if I should keep using EC2MultiRegionSnitch or change the snitch to anything else. What about broadcast/listen ip, seeds...? We currently use public ip as for broadcast address and for seeds. We use private ones for listen address. Machines inside the VPC will only have private IP AFAIK. Should I keep using a broadcast address ? Is there any other incidence when switching to a VPC ? Sorry if the topic was already discussed, I was unable to find any useful information...
Re: VPC AWS
We personally use the EC2Snitch, however, we don't have the multi-region requirements you do, -Mike From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org Sent: Thursday, June 5, 2014 9:14 AM Subject: Re: VPC AWS I think you can define VPC subnet to be public (to have public + private IPs) or private only. Any insight regarding snitches ? What snitch do you guys use ? 2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com: I don't think traffic will flow between classic ec2 and vpc directly. There is some kind of gateway bridge instance that sits between, acting as a NAT. I would think that would cause new challenges for: -transitions -clients Sorry this response isn't heavy on content! I'm curious how this thread goes... Will On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We are going to move from a cluster made of simple Amazon EC2 servers to a VPC cluster. We are using Cassandra 1.2.11 and I have some questions regarding this switch and the Cassandra configuration inside a VPC. Actually I found no documentation on this topic, but I am quite sure that some people are already using VPC. If you can point me to any documentation regarding VPC / Cassandra, it would be very nice of you. We have only one DC for now, but we need to remain multi DC compatible, since we will add DC very soon. Else, I would like to know if I should keep using EC2MultiRegionSnitch or change the snitch to anything else. What about broadcast/listen ip, seeds...? We currently use public ip as for broadcast address and for seeds. We use private ones for listen address. Machines inside the VPC will only have private IP AFAIK. Should I keep using a broadcast address ? Is there any other incidence when switching to a VPC ? Sorry if the topic was already discussed, I was unable to find any useful information... -- Will Oberman Civic Science, Inc. 6101 Penn Avenue, Fifth Floor Pittsburgh, PA 15206 (M) 412-480-7835 (E) ober...@civicscience.com
Re: VPC AWS
The implementation of moving from EC2 to a VPC was a bit of a juggling act. Our motivation was two fold: 1) We were running out of static IP addresses, and it was becoming increasingly difficult in EC2 to design around limiting the number of static IP addresses to the number of public IP addresses EC2 allowed 2) VPC affords us an additional level of security that was desirable. However, we needed to consider the following limitations: 1) By default, you have a limited number of available public IPs for both EC2 and VPC. 2) AWS security groups need to be configured to allow traffic for Cassandra to/from instances in EC2 and the VPC. You are correct at the high level that the migration goes from EC2-Public VPC (VPC with an Internet Gateway)-Private VPC (VPC with a NAT). The first phase was moving instances to the public VPC, setting broadcast and seeds to the public IPs we had available. Basically: 1) Take down a node, taking a snapshot for a backup 2) Restore the node on the public VPC, assigning it to the correct security group, manually setting the seeds to other available nodes 3) Verify the cluster can communicate 4) Repeat Realize the NAT instance on the private subnet will also require a public IP. What got really interesting is that near the end of the process we ran out of available IPs, requiring us to switch the final node that was on EC2 directly to the private VPC (and taking down two nodes at once, which our setup allowed given we had 6 nodes with an RF of 3). What we did, and highly suggest for the switch, is to write down every step that has to happen on every node during the switch. In our case, many of the moved nodes required slightly different configurations for items like the seeds. Its been a couple of years, so my memory on this maybe a little fuzzy :) -Mike From: Aiman Parvaiz ai...@shift.com To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com Sent: Thursday, June 5, 2014 12:55 PM Subject: Re: VPC AWS Michael, Thanks for the response, I am about to head in to something very similar if not exactly same. I envision things happening on the same lines as you mentioned. I would be grateful if you could please throw some more light on how you went about switching cassandra nodes from public subnet to private with out any downtime. I have not started on this project yet, still in my research phase. I plan to have a ec2+public VPC cluster and then decomission ec2 nodes to have everything in public subnet, next would be to move it to private subnet. Thanks On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux mthero...@yahoo.com wrote: We personally use the EC2Snitch, however, we don't have the multi-region requirements you do, -Mike From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org Sent: Thursday, June 5, 2014 9:14 AM Subject: Re: VPC AWS I think you can define VPC subnet to be public (to have public + private IPs) or private only. Any insight regarding snitches ? What snitch do you guys use ? 2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com: I don't think traffic will flow between classic ec2 and vpc directly. There is some kind of gateway bridge instance that sits between, acting as a NAT. I would think that would cause new challenges for: -transitions -clients Sorry this response isn't heavy on content! I'm curious how this thread goes... Will On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We are going to move from a cluster made of simple Amazon EC2 servers to a VPC cluster. We are using Cassandra 1.2.11 and I have some questions regarding this switch and the Cassandra configuration inside a VPC. Actually I found no documentation on this topic, but I am quite sure that some people are already using VPC. If you can point me to any documentation regarding VPC / Cassandra, it would be very nice of you. We have only one DC for now, but we need to remain multi DC compatible, since we will add DC very soon. Else, I would like to know if I should keep using EC2MultiRegionSnitch or change the snitch to anything else. What about broadcast/listen ip, seeds...? We currently use public ip as for broadcast address and for seeds. We use private ones for listen address. Machines inside the VPC will only have private IP AFAIK. Should I keep using a broadcast address ? Is there any other incidence when switching to a VPC ? Sorry if the topic was already discussed, I was unable to find any useful information... -- Will Oberman Civic Science, Inc. 6101 Penn Avenue, Fifth Floor Pittsburgh, PA 15206 (M) 412-480-7835 (E) ober...@civicscience.com
Re: VPC AWS
You can have a ring spread across EC2 and the public subnet of a VPC. That is how we did our migration. In our case, we simply replaced the existing EC2 node with a new instance in the public VPC, restored from a backup taken right before the switch. -Mike From: Aiman Parvaiz ai...@shift.com To: Michael Theroux mthero...@yahoo.com Cc: user@cassandra.apache.org user@cassandra.apache.org Sent: Thursday, June 5, 2014 2:39 PM Subject: Re: VPC AWS Thanks for this info Michael. As far as restoring node in public VPC is concerned I was thinking ( and I might be wrong here) if we can have a ring spread across EC2 and public subnet of a VPC, this way I can simply decommission nodes in Ec2 as I gradually introduce new nodes in public subnet of VPC and I will end up with a ring in public subnet and then migrate them from public to private in a similar way may be. If anyone has any experience/ suggestions with this please share, would really appreciate it. Aiman On Thu, Jun 5, 2014 at 10:37 AM, Michael Theroux mthero...@yahoo.com wrote: The implementation of moving from EC2 to a VPC was a bit of a juggling act. Our motivation was two fold: 1) We were running out of static IP addresses, and it was becoming increasingly difficult in EC2 to design around limiting the number of static IP addresses to the number of public IP addresses EC2 allowed 2) VPC affords us an additional level of security that was desirable. However, we needed to consider the following limitations: 1) By default, you have a limited number of available public IPs for both EC2 and VPC. 2) AWS security groups need to be configured to allow traffic for Cassandra to/from instances in EC2 and the VPC. You are correct at the high level that the migration goes from EC2-Public VPC (VPC with an Internet Gateway)-Private VPC (VPC with a NAT). The first phase was moving instances to the public VPC, setting broadcast and seeds to the public IPs we had available. Basically: 1) Take down a node, taking a snapshot for a backup 2) Restore the node on the public VPC, assigning it to the correct security group, manually setting the seeds to other available nodes 3) Verify the cluster can communicate 4) Repeat Realize the NAT instance on the private subnet will also require a public IP. What got really interesting is that near the end of the process we ran out of available IPs, requiring us to switch the final node that was on EC2 directly to the private VPC (and taking down two nodes at once, which our setup allowed given we had 6 nodes with an RF of 3). What we did, and highly suggest for the switch, is to write down every step that has to happen on every node during the switch. In our case, many of the moved nodes required slightly different configurations for items like the seeds. Its been a couple of years, so my memory on this maybe a little fuzzy :) -Mike From: Aiman Parvaiz ai...@shift.com To: user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com Sent: Thursday, June 5, 2014 12:55 PM Subject: Re: VPC AWS Michael, Thanks for the response, I am about to head in to something very similar if not exactly same. I envision things happening on the same lines as you mentioned. I would be grateful if you could please throw some more light on how you went about switching cassandra nodes from public subnet to private with out any downtime. I have not started on this project yet, still in my research phase. I plan to have a ec2+public VPC cluster and then decomission ec2 nodes to have everything in public subnet, next would be to move it to private subnet. Thanks On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux mthero...@yahoo.com wrote: We personally use the EC2Snitch, however, we don't have the multi-region requirements you do, -Mike From: Alain RODRIGUEZ arodr...@gmail.com To: user@cassandra.apache.org Sent: Thursday, June 5, 2014 9:14 AM Subject: Re: VPC AWS I think you can define VPC subnet to be public (to have public + private IPs) or private only. Any insight regarding snitches ? What snitch do you guys use ? 2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com: I don't think traffic will flow between classic ec2 and vpc directly. There is some kind of gateway bridge instance that sits between, acting as a NAT. I would think that would cause new challenges for: -transitions -clients Sorry this response isn't heavy on content! I'm curious how this thread goes... Will On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We are going to move from a cluster made of simple Amazon EC2 servers to a VPC cluster. We are using Cassandra 1.2.11 and I have some questions regarding this switch and the Cassandra configuration inside a VPC. Actually I found no documentation on this topic, but I am quite
Re: cassandra backup
Hi Marcelo, Cassandra provides and eventually consistent model for backups. You can do staggered backups of data, with the idea that if you restore a node, and then do a repair, your data will be once again consistent. Cassandra will not automatically copy the data to other nodes (other than via hinted handoff). You should manually run repair after restoring a node. You should take snapshots when doing a backup, as it keeps the data you are backing up relevant to a single point in time, otherwise compaction could add/delete files one you mid-backup, or worse, I imagine attempt to access a SSTable mid-write. Snapshots work by using links, and don't take additional storage to perform. In our process we create the snapshot, perform the backup, and then clear the snapshot. One thing to keep in mind in your S3 cost analysis is that, even though storage is cheap, reads/writes to S3 are not (especially writes). If you are using LeveledCompaction, or otherwise have a ton of SSTables, some people have encountered increased costs moving the data to S3. Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync data too. Thus far this has worked very well for us. -Mike On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hello everyone, I am trying to create backups of my data on AWS. My goal is to store the backups on S3 or glacier, as it's cheap to store this kind of data. So, if I have a cluster with N nodes, I would like to copy data from all N nodes to S3 and be able to restore later. I know Priam does that (we were using it), but I am using the latest cassandra version and we plan to use DSE some time, I am not sure Priam fits this case. I took a look at the docs: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html And I am trying to understand if it's really needed to take a snapshot to create my backup. Suppose I do a flush and copy the sstables from each node, 1 by one, to s3. Not all at the same time, but one by one. When I try to restore my backup, data from node 1 will be older than data from node 2. Will this cause problems? AFAIK, if I am using a replication factor of 2, for instance, and Cassandra sees data from node X only, it will automatically copy it to other nodes, right? Is there any chance of cassandra nodes become corrupt somehow if I do my backups this way? Best regards, Marcelo Valle.
Re: Compaction issues
One more note, When we did this conversion, we were on Cassandra 1.1.X. You didn't mention what version of Cassandra you were running, Thanks, -Mike On Oct 23, 2013, at 10:05 AM, Michael Theroux wrote: When we made a similar move, for an unknown reason (I didn't hear any feedback from the list when I asked why this might be), compaction didn't start after we moved from SizedTiered to leveled compaction until I ran nodetool compact keyspace column-family-converted-to-lcs. The thread is here: http://www.mail-archive.com/user@cassandra.apache.org/msg27726.html I've also seen other individuals on this list state that those pending compaction stats didn't move unless the node was restarted. Compaction started to run several minutes after restart. Thanks, -Mike On Oct 23, 2013, at 9:14 AM, Russ Garrett wrote: Hi, We have a cluster which we've recently moved to use LeveledCompactionStrategy. We were experiencing some disk space issues, so we added two additional nodes temporarily to aid compaction. Once the compaction had completed on all nodes, we decommissioned the two temporary nodes. All nodes now have a high number of pending tasks which isn't dropping - they're remaining approximately static. There are constantly compaction tasks running, but when they complete, the pending tasks number doesn't drop. We've set the compaction rate limit to 0, and increased the number of compactor threads until the I/O utilisation is at maximum, but neither of these have helped. Any suggestions? Cheers, -- Russ Garrett r...@garrett.co.uk
Re: DELETE does not delete :)
A couple questions: 1) How did you determine that the record is deleted on only one node? Are you looking for tombstones, or the original entry that was inserted? Note that when an item is deleted, the original entry can still be in an SSTABLE somewhere, and the tombstone can be in another SSTABLE until those tables are compacted together. 2) When you did the global daily check, are you sure you are not getting range ghosts? I assume they are still possible on 2.0 (http://www.datastax.com/docs/0.7/getting_started/using_cli, search for range ghosts). Thanks, -Mike On Thursday, October 17, 2013 6:36 AM, Alexander Shutyaev shuty...@gmail.com wrote: Hi Daniel, Nate. Thanks for your answers. We have gc_grace_seconds=864000 (which is the default, I believe). We've also checked the clocks - they are synchronized. 2013/10/16 Nate McCall n...@thelastpickle.com This is almost a guaranteed sign that the clocks are off in your cluster. If you run the select query a couple of times in a row right after deletion, do you see the data appear again? On Wed, Oct 16, 2013 at 12:12 AM, Alexander Shutyaev shuty...@gmail.com wrote: Hi all, Unfortunately, we still have a problem. I've modified my code, so that it explicitly sets the consistency level to QUORUM for each query. However, we found out a few cases when the record is deleted on only 1 node of 3. In this cases the delete query executed ok, and the select query that we do right after delete returned 0 rows. Later when we ran a global daily check select returned 1 row. How can that be? What can we be missing? 2013/10/7 Jon Haddad j...@jonhaddad.com I haven't used VMWare but it seems odd that it would lock up the ntp port. try ps aux | grep ntp to see if ntpd it's already running. On Oct 7, 2013, at 12:23 AM, Alexander Shutyaev shuty...@gmail.com wrote: Hi Michał, I didn't notice your message at first.. Well this seems like a real cause candidate.. I'll add an explicit consistency level QUORUM and see if that helps. Thanks 2013/10/7 Alexander Shutyaev shuty...@gmail.com Hi Nick, Thanks for the note! We have our cassanra instances installed on virtual hosts in VMWare and the clock synchronization is handled by the latter, so I can't use ntpdate (says that NTP socket is in use). Is there any way to check if the clocks are really synchronized? My best attempt was using three shell windows with commands already typed thus requiring only clicking on the window and hitting enter. The results varied by 100-200 msec which I guess is just about the time I need to click and press enter :) Thanks in advance, Alexander 2013/10/7 Nikolay Mihaylov n...@nmmm.nu Hi my two cents - before doing anything else, make sure clocks are synchronized to the millisecond. ntp will do so. Nick. On Mon, Oct 7, 2013 at 9:02 AM, Alexander Shutyaev shuty...@gmail.com wrote: Hi all, We have encountered the following problem with cassandra. * We use cassandra v2.0.0 from Datastax community repo. * We have 3 nodes in a cluster, all of them are seed providers. * We have a single keyspace with replication factor = 3: CREATE KEYSPACE bof WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '3' }; * We use Datastax Java CQL Driver v1.0.3 in our application. * We have not modified any consistency settings in our app, so I assume we have the default QUORUM (2 out of 3 in our case) consistency for reads and writes. * We have 400+ tables which can be divided in two groups (main and uids). All tables in a group have the same definition, they vary only by name. The sample definitions are: CREATE TABLE bookingfile ( key text, entity_created timestamp, entity_createdby text, entity_entitytype text, entity_modified timestamp, entity_modifiedby text, entity_status text, entity_uid text, entity_updatepolicy text, version_created timestamp, version_createdby text, version_data blob, version_dataformat text, version_datasource text, version_modified timestamp, version_modifiedby text, version_uid text, version_versionnotes text, version_versionnumber int, versionscount int, PRIMARY KEY (key) ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='NONE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; CREATE TABLE bookingfile_uids ( date text, timeanduid text, deleted boolean, PRIMARY KEY (date, timeanduid) ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND
Reverse compaction on 1.1.11?
Hello, Quick question. Is there a tool that allows sstablesplit (reverse compaction) against 1.1.11 sstables? I seem to recall a separate utility somewhere, but I'm having difficulty locating it, Thanks, -Mike
Issue with leveled compaction and data migration
Hello, We've been undergoing a migration on Cassandra 1.1.9 where we are combining two column families. We are incrementally moving data from one column family into another, where the columns in a row in the source column family are being appended to columns in a row in the target column family. Both column families are using leveled compaction, and both column families have over 100 million rows. However, our bloom filters on the target column family grow dramatically (less than double) after converting less than 1/4 of the data. I assume this is because new changes are not being compacted with older changes, although I thought leveled compaction would mitigate this for me. Any advice on what we can do to control our bloom filter growth during this migration? Appreciate the help, Thanks, -Mike
Temporarily slow nodes on Cassandra
Hello, We are experiencing an issue where nodes a temporarily slow due to I/O contention anywhere from 10 minutes to 2 hours. I don't believe this slowdown is Cassandra related, but factors outside of Cassandra. We run Cassandra 1.1.9. We run a 12 node cluster, with a replication factor of 3, and all queries use LOCAL_QUORUM consistency. Our problem is (other than the contention issue, which we are working on), when this one node slows down, the whole system performance appears to slow down. Is there a way in Cassandra to accommodate or mitigate slower nodes? Shutting down the node in question during the period of contention does resolve the performance problem, but is there anything in cassandra that can assist this situation while we resolve the hardware problem? Thanks, -Mike
TTL, Tombstones, and gc_grace
Hello, Quick question on Cassandra, TTLs, tombstones, and GC grace. If we have a column family whose only mechanism of deleting columns is utilizing TTLs, is repair really necessary to make tombstones consistent, and therefore would it be safe to set the gc grace period of the column family to a very low value? I ask because of this blog post based on Cassandra .7: http://www.datastax.com/dev/blog/whats-new-cassandra-07-expiring-columns. The first time the expired column is compacted, it is transformed into a tombstone. This transformation frees some disk space: the size of the value of the expired column. From that moment on, the column is a normal tombstone and follows the tombstone rules: it will be totally removed by compaction (including minor ones in most cases since Cassandra 0.6.6) after GCGraceSeconds. Since tombstones are not written using a replicated write, but instead written during compaction, theoretically, it shouldn't be possible to lose a tombstone? Or is this blog post inaccurate for later versions of cassandra? We are using cassandra 1.1.11. Thanks, -Mike
Re: Deletion use more space.
The only time information is removed from the filesystem is during compaction. Compaction can remove tombstones after gc_grace_seconds, which, could result in reanimation of deleted data if the tombstone was never properly replicated to other replicas. Repair will make sure tombstones are consistent amongst replicas. However, tombstones can not be removed if the data the tombstone is deleting is in another SSTable and has not yet been removed. Hope this helps, -Mike On Jul 16, 2013, at 10:04 AM, Andrew Bialecki wrote: I don't think setting gc_grace_seconds to an hour is going to do what you'd expect. After gc_grace_seconds, if you haven't run a repair within that hour, the data you deleted will seem to have been undeleted. Someone correct me if I'm wrong, but in order to order to completely delete data and regain the space it takes up, you need to delete it, which creates tombstones, and then run a repair on that column family within gc_grace_seconds. After that the data is actually gone and the space reclaimed. On Tue, Jul 16, 2013 at 6:20 AM, 杨辉强 huiqiangy...@yunrang.com wrote: Thank you! It should be update column family ScheduleInfoCF with gc_grace = 3600; Faint. - 原始邮件 - 发件人: 杨辉强 huiqiangy...@yunrang.com 收件人: user@cassandra.apache.org 发送时间: 星期二, 2013年 7 月 16日 下午 6:15:12 主题: Re: Deletion use more space. Hi, I use the follow cmd to update gc_grace_seconds. It reports error! Why? [default@WebSearch] update column family ScheduleInfoCF with gc_grace_seconds = 3600; java.lang.IllegalArgumentException: No enum const class org.apache.cassandra.cli.CliClient$ColumnFamilyArgument.GC_GRACE_SECONDS - 原始邮件 - 发件人: Michał Michalski mich...@opera.com 收件人: user@cassandra.apache.org 发送时间: 星期二, 2013年 7 月 16日 下午 5:51:49 主题: Re: Deletion use more space. Deletion is not really removing data, but it's adding tombstones (markers) of deletion. They'll be later merged with existing data during compaction and - in the end (see: gc_grace_seconds) - removed, but by this time they'll take some space. http://wiki.apache.org/cassandra/DistributedDeletes M. W dniu 16.07.2013 11:46, 杨辉强 pisze: Hi, all: I use cassandra 1.2.4 and I have 4 nodes ring and use byte order partitioner. I had inserted about 200G data in the ring previous days. Today I write a program to scan the ring and then at the same time delete the items that are scanned. To my surprise, the cassandra cost more disk usage. Anybody can tell me why? Thanks.
Re: Alternate major compaction
Information is only deleted from Cassandra during a compaction. Using SizeTieredCompaction, compaction only occurs when a number of similarly sized sstables are combined into a new sstable. When you perform a major compaction, all sstables are combined into one, very large, sstable. As a result, any tombstoned data in that large sstable will only be removed when a number of very large sstable exists. This means tombstoned data maybe trapped in that sstable for a very long time (or indefinitely depending on your usecase). -Mike On Jul 11, 2013, at 9:31 AM, Brian Tarbox wrote: Perhaps I should already know this but why is running a major compaction considered so bad? We're running 1.1.6. Thanks. On Thu, Jul 11, 2013 at 7:51 AM, Takenori Sato ts...@cloudian.com wrote: Hi, I think it is a common headache for users running a large Cassandra cluster in production. Running a major compaction is not the only cause, but more. For example, I see two typical scenario. 1. backup use case 2. active wide row In the case of 1, say, one data is removed a year later. This means, tombstone on the row is 1 year away from the original row. To remove an expired row entirely, a compaction set has to include all the rows. So, when do the original, 1 year old row, and the tombstoned row are included in a compaction set? It is likely to take one year. In the case of 2, such an active wide row exists in most of sstable files. And it typically contains many expired columns. But none of them wouldn't be removed entirely because a compaction set practically do not include all the row fragments. Btw, there is a very convenient MBean API is available. It is CompactionManager's forceUserDefinedCompaction. You can invoke a minor compaction on a file set you define. So the question is how to find an optimal set of sstable files. Then, I wrote a tool to check garbage, and print outs some useful information to find such an optimal set. Here's a simple log output. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504071)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db --- TOTAL, 40, 40 === REMAINNING_SSTABLE_FILES means any other sstable files that contain the respective row. So, the following is an optimal set. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504131)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 223, 0, YES, YES --- TOTAL, 223, 0 === This tool relies on SSTableReader and an aggregation iterator as Cassandra does in compaction. I was considering to share this with the community. So let me know if anyone is interested. Ah, note that it is based on 1.0.7. So I will need to check and update for newer versions. Thanks, Takenori On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.com wrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the
Repair of tombstones
There has been a lot of discussion on the list recently concerning issues with repair, runtime, etc. We recently have had issues with this cassandra bug: https://issues.apache.org/jira/browse/CASSANDRA-4905 Basically, if you do regular staggered repairs, and you have tombstones that can be gc_graced, those tombstones may never be cleaned up if those tombstones don't get compacted away before the next repair. This is because these tombstones are essentially recopied to other nodes during the next repair. This has been fixed in 1.2, however, we aren't ready to make the jump to 1.2 yet. Is there a reason why this hasn't been back-ported to 1.1? Is it a risky change? Although not a silver bullet, it seems it may help a lot of people with repair issues (certainly seems it would help us), -Mike
CQL Clarification
Hello, Just wondering if I can get a quick clarification on some simple CQL. We utilize Thrift CQL Queries to access our cassandra setup. As clarified in a previous question I had, when using CQL and Thrift, timestamps on the cassandra column data is assigned by the server, not the client, unless AND TIMESTAMP is utilized in the query, for example: http://www.datastax.com/docs/1.0/references/cql/UPDATE According to the Datastax documentation, this timestamp should be: Values serialized with the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as the epoch: January 1 1970 at 00:00:00 GMT. However, my testing showed that updates didn't work when I used a timestamp of this format. Looking at the Cassandra code, it appears that cassandra will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not specified, which would be the number of nanoseconds since the stand base time. In my test environment, setting the timestamp to be the current time * 1000 seems to work. It seems that if you have an older installation without TIMESTAMP being specified in the CQL, or a mixed environment, the timestamp should be * 1000. Just making sure I'm reading everything properly... improperly setting the timestamp could cause us some serious damage. Thanks, -Mike
Re: Really odd issue (AWS related?)
Hello, We've done some additional monitoring, and I think we have more information. We've been collecting vmstat information every minute, attempting to catch a node with issues,. So, it appears, that the cassandra node runs fine. Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely. Even non-cassandra I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not. Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O. Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes. The I/O wait times do go back down when there is literally no activity. - Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue. - Backing up the node, and replacing the instance does correct the issue. I/O wait times return to normal. One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available. We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput. When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives. We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem). LVM just worked (and had worked for months before this upgrade).. For reference, this is the script we used to create the logical volume: vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K blockdev --setra 65536 /dev/mnt_vg/mnt_lv sleep 2 mkfs.xfs /dev/mnt_vg/mnt_lv sleep 3 mkdir -p /data mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data sleep 3 Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east. Other availability zones, in the same region, have yet to show an issue. It looks like I'm going to need to replace a third DB node today. Any advice would be appreciated. Thanks, -Mike On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote: Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Re: Really odd issue (AWS related?)
I forgot to mention, When things go really bad, I'm seeing I/O waits in the 80-95% range. I restarted cassandra once when a node is in this situation, and it took 45 minutes to start (primarily reading SSTables). Typically, a node would start in about 5 minutes. Thanks, -Mike On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote: Hello, We've done some additional monitoring, and I think we have more information. We've been collecting vmstat information every minute, attempting to catch a node with issues,. So, it appears, that the cassandra node runs fine. Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely. Even non-cassandra I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not. Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O. Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes. The I/O wait times do go back down when there is literally no activity. - Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue. - Backing up the node, and replacing the instance does correct the issue. I/O wait times return to normal. One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available. We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput. When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives. We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem). LVM just worked (and had worked for months before this upgrade).. For reference, this is the script we used to create the logical volume: vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K blockdev --setra 65536 /dev/mnt_vg/mnt_lv sleep 2 mkfs.xfs /dev/mnt_vg/mnt_lv sleep 3 mkdir -p /data mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data sleep 3 Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east. Other availability zones, in the same region, have yet to show an issue. It looks like I'm going to need to replace a third DB node today. Any advice would be appreciated. Thanks, -Mike On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote: Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Re: CQL Clarification
Yes, that does help, So, in the link I provided: http://www.datastax.com/docs/1.0/references/cql/UPDATE It states: You can specify these options: Consistency level Time-to-live (TTL) Timestamp for the written columns. Where timestamp is a link to Working with dates and times and mentions the 64bit millisecond value. Is that incorrect? -Mike On Apr 28, 2013, at 11:42 AM, Michael Theroux wrote: Hello, Just wondering if I can get a quick clarification on some simple CQL. We utilize Thrift CQL Queries to access our cassandra setup. As clarified in a previous question I had, when using CQL and Thrift, timestamps on the cassandra column data is assigned by the server, not the client, unless AND TIMESTAMP is utilized in the query, for example: http://www.datastax.com/docs/1.0/references/cql/UPDATE According to the Datastax documentation, this timestamp should be: Values serialized with the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as the epoch: January 1 1970 at 00:00:00 GMT. However, my testing showed that updates didn't work when I used a timestamp of this format. Looking at the Cassandra code, it appears that cassandra will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not specified, which would be the number of nanoseconds since the stand base time. In my test environment, setting the timestamp to be the current time * 1000 seems to work. It seems that if you have an older installation without TIMESTAMP being specified in the CQL, or a mixed environment, the timestamp should be * 1000. Just making sure I'm reading everything properly... improperly setting the timestamp could cause us some serious damage. Thanks, -Mike
Re: Really odd issue (AWS related?)
Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Re: Really odd issue (AWS related?)
Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Really odd issue (AWS related?)
Hello, Since Sunday, we've been experiencing a really odd issue in our Cassandra cluster. We recently started receiving errors that messages are being dropped. But here is the odd part... When looking in the AWS console, instead of seeing statistics being elevated during this time, we actually see all statistics suddenly drop right before these messages appear. CPU, I/O, and network go way down. In fact, in one case, they went to 0 for about 5 minutes to the point that other cassandra nodes saw this specific node in question as being down. The messages appear right after the node wakes up. We've had this happen on 3 different nodes on three different days since Sunday. Other facts: - We recently upgraded from m1.large to m1.xlarge instances about two weeks ago. - We are running Cassandra 1.1.9 - We've been doing some memory tuning, although I have seen this happen on untuned nodes. Has anyone seen anything like this before? Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? Thanks for your help, -Mike
Re: Advice on memory warning
[ScheduledTasks:1] 2013-04-23 16:40:30,845 StatusLogger.java (line 112) system.batchlog 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,845 StatusLogger.java (line 112) system.NodeIdInfo 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) system.LocationInfo 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) system.Schema 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) system.Migrations 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) system.schema_keyspaces 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) system.schema_columns 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,846 StatusLogger.java (line 112) system.schema_columnfamilies 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) system.IndexInfo 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) system.range_xfers0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) system.peer_events0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) system.hints 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,847 StatusLogger.java (line 112) system.HintsColumnFamily 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) x.foo 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) x.foo2 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) x.foo3 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) x.foo4 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,848 StatusLogger.java (line 112) x.foo5 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) x.foo6 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) x.foo7 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) system_auth.users 0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) system_traces.sessions0,0 INFO [ScheduledTasks:1] 2013-04-23 16:40:30,849 StatusLogger.java (line 112) system_traces.events 0,0 WARN [ScheduledTasks:1] 2013-04-23 16:40:30,850 GCInspector.java (line 142) Heap is 0.824762725573964 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically INFO [ScheduledTasks:1] 2013-04-23 16:40:30,850 StorageService.java (line 3537) Unable to reduce heap usage since there are no dirty column families On 23 April 2013 16:52, Ralph Goers ralph.go...@dslextreme.com wrote: We are using DSE, which I believe is also 1.1.9. We have basically had a non-usable cluster for months due to this error. In our case, once it starts doing this it starts flushing sstables to disk and eventually fills up the disk to the point where it can't compact. If we catch it soon enough and restart the node it usually can recover. In our case, the heap size is 12 GB. As I understand it Cassandra will give 1/3 of that for sstables. I then noticed that we have one column family that is using nearly 4GB in bloom filters on each node. Since the nodes will start doing this when the heap reaches 9GB we essentially only have 1GB of free memory so when compactions, cleanups, etc take place this situation starts happening. We are working to change our data model to try to resolve this. Ralph On Apr 19, 2013, at 8:00 AM, Michael Theroux wrote: Hello, We've recently upgraded from m1.large to m1.xlarge instances on AWS to handle additional load, but to also relieve memory pressure. It appears to have accomplished both, however, we are still getting a warning, 0-3 times a day, on our database nodes: WARN [ScheduledTasks:1] 2013-04-19 14:17:46,532 GCInspector.java (line 145) Heap is 0.7529240824406468 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically This is happening much less frequently than before the upgrade, but after essentially
Re: Moving cluster
I believe the two solutions that are being referred to is the lift and shift vs. upgrading by replacing a node and letting it restore from the cluster. I don't think there are any more risks per-say on the upgrading by replacing, as long as you can make sure your new node is configured properly. One might choose to do lift-and-shift in order to have a node down for less time (depending on your individual situation), or to have less of an impact on the cluster, as replacing a node would result in other nodes streaming their data to the newly replaced node. Depending on your dataset, this could take quite some time. All this also assumes, of course, that you are replicating your data such that the new node can retrieve the information it is responsible for from the other nodes. Thanks, -Mike On Apr 21, 2013, at 4:18 PM, aaron morton wrote: Sorry i do not understand you question. What are the two solutions ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 20/04/2013, at 3:43 AM, Kais Ahmed k...@neteck-fr.com wrote: Hello and thank you for your answers. The first solution is much easier for me because I use the vnode. What is the risk of the first solution thank you, 2013/4/18 aaron morton aa...@thelastpickle.com This is roughly the lift and shift process I use. Note that disabling thrift and gossip does not stop an existing repair session. So I often drain and then shutdown, and copy the live data dir rather than a snapshot dir. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 19/04/2013, at 4:10 AM, Michael Theroux mthero...@yahoo.com wrote: This should work. Another option is to follow a process similar to what we recently did. We recently and successfully upgraded 12 instances from large to xlarge instances in AWS. I chose not to replace nodes as restoring data from the ring would have taken significant time and put the cluster under some additional load. I also wanted to eliminate the possibility that any issues on the new nodes could be blamed on new configuration/operating system differences. Instead we followed the following procedure (removing some details that would likely be unique to our infrastructure). For a node being upgraded: 1) nodetool disable thrift 2) nodetool disable gossip 3) Snapshot the data (nodetool snapshot ...) 4) Backup the snapshot data to EBS (assuming you are on ephemeral) 5) Stop cassandra 6) Move the cassandra.yaml configuration file to cassandra.yaml.bak (to prevent any future restarts to cause cassandra to restart) 7) Shutdown the instance 8) Take an AMI of the instance 9) Start a new instance from the AMI with the desired hardware 10) If you assign the new instance a new IP Address, make sure any entries in /etc/hosts, or the broadcast_address in cassandra.yaml is updated 11) Attach the volume you backed up your snapshot data to to the new instance and mount it 12) Restore the snapshot data 13) Restore cassandra.yaml file 13) Restart cassandra - I recommend practicing this on a test cluster first - As you replace nodes with new IP Addresses, eventually all your seeds will need be updated. This is not a big deal until all your seed nodes have been replaced. - Don't forget about NTP! Make sure it is running on all your new nodes. Myself, to be extra careful, I actually deleted the ntp drift file and let NTP recalculate it because its a new instance, and it took over an hour to restore our snapshot data... but that may have been overkill. - If you have the opportunity, depending on your situation, increase the max_hint_window_in_ms - Your details may vary Thanks, -Mike On Apr 18, 2013, at 11:07 AM, Alain RODRIGUEZ wrote: I would say add your 3 servers to the 3 tokens where you want them, let's say : { 0: { 0: 0, 1: 56713727820156410577229101238628035242, 2: 113427455640312821154458202477256070485 } } or these token -1 or +1 if you already have these token used. And then just decommission x1Large nodes. You should be good to go. 2013/4/18 Kais Ahmed k...@neteck-fr.com Hi, What is the best pratice to move from a cluster of 7 nodes (m1.xlarge) to 3 nodes (hi1.4xlarge). Thanks,
Advice on memory warning
Hello, We've recently upgraded from m1.large to m1.xlarge instances on AWS to handle additional load, but to also relieve memory pressure. It appears to have accomplished both, however, we are still getting a warning, 0-3 times a day, on our database nodes: WARN [ScheduledTasks:1] 2013-04-19 14:17:46,532 GCInspector.java (line 145) Heap is 0.7529240824406468 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically This is happening much less frequently than before the upgrade, but after essentially doubling the amount of available memory, I'm curious on what I can do to determine what is happening during this time. I am collecting all the JMX statistics. Memtable space is elevated but not extraordinarily high. No GC messages are being output to the log. These warnings do seem to be occurring doing compactions of column families using LCS with wide rows, but I'm not sure there is a direct correlation. We are running Cassandra 1.1.9, with a maximum heap of 8G. Any advice? Thanks, -Mike
Re: CQL
A lot more details on your usecase and requirements would help. You need to make specific considerations in cassandra when you have requirements around ordering. Ordering can be achieved across columns. Ordering across rows is a bit more tricky and may require the use of specific partitioners... I did a real quick google search on cassandra ordering and found some good links (http://ayogo.com/blog/sorting-in-cassandra/). To do queries with ordering across columns, using the FIRST keyword will return results based on the comparator you defined in your schema: http://cassandra.apache.org/doc/cql/CQL.html#SELECT Hope this helps, -Mike On Apr 19, 2013, at 7:32 AM, Sri Ramya wrote: hi, I am working with CQl. I want perform query base on timestamp. Can anyone help me out of this how to get date great than or less than a given time timestamp in cassandra.
Re: Moving cluster
This should work. Another option is to follow a process similar to what we recently did. We recently and successfully upgraded 12 instances from large to xlarge instances in AWS. I chose not to replace nodes as restoring data from the ring would have taken significant time and put the cluster under some additional load. I also wanted to eliminate the possibility that any issues on the new nodes could be blamed on new configuration/operating system differences. Instead we followed the following procedure (removing some details that would likely be unique to our infrastructure). For a node being upgraded: 1) nodetool disable thrift 2) nodetool disable gossip 3) Snapshot the data (nodetool snapshot ...) 4) Backup the snapshot data to EBS (assuming you are on ephemeral) 5) Stop cassandra 6) Move the cassandra.yaml configuration file to cassandra.yaml.bak (to prevent any future restarts to cause cassandra to restart) 7) Shutdown the instance 8) Take an AMI of the instance 9) Start a new instance from the AMI with the desired hardware 10) If you assign the new instance a new IP Address, make sure any entries in /etc/hosts, or the broadcast_address in cassandra.yaml is updated 11) Attach the volume you backed up your snapshot data to to the new instance and mount it 12) Restore the snapshot data 13) Restore cassandra.yaml file 13) Restart cassandra - I recommend practicing this on a test cluster first - As you replace nodes with new IP Addresses, eventually all your seeds will need be updated. This is not a big deal until all your seed nodes have been replaced. - Don't forget about NTP! Make sure it is running on all your new nodes. Myself, to be extra careful, I actually deleted the ntp drift file and let NTP recalculate it because its a new instance, and it took over an hour to restore our snapshot data... but that may have been overkill. - If you have the opportunity, depending on your situation, increase the max_hint_window_in_ms - Your details may vary Thanks, -Mike On Apr 18, 2013, at 11:07 AM, Alain RODRIGUEZ wrote: I would say add your 3 servers to the 3 tokens where you want them, let's say : { 0: { 0: 0, 1: 56713727820156410577229101238628035242, 2: 113427455640312821154458202477256070485 } } or these token -1 or +1 if you already have these token used. And then just decommission x1Large nodes. You should be good to go. 2013/4/18 Kais Ahmed k...@neteck-fr.com Hi, What is the best pratice to move from a cluster of 7 nodes (m1.xlarge) to 3 nodes (hi1.4xlarge). Thanks,
Timestamps and CQL
Hello, We are having an odd sporadic issue that I believe maybe due to time synchronization. Without going into details on the issue right now, quick question, from the documentation I see numerous references that Cassandra utilizes timestamps generated by the clients to determine write serialization. However, going through the code, it appears that for CQL over thrift, it will only use client generated timestamps if USING TIMESTAMP is utilized in the cql statement, otherwise it will use a server-generated timestamp. I was looking through 1.1.2 code. Am I reading this correctly? -Mike
Re: 13k pending compaction tasks but ZERO running?
Hi Dean, I saw the same behavior when we switched from STCS to LCS on a couple of our tables. Not sure why it doesn't proceed immediately (I pinged the list, but didn't get any feedback). However, running nodetool compact keyspace table got things moving for me. -Mike On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote: How do I get my node to run through the 13k pending compaction tasks? I had to use iptables to take the ring out of the cluster for now and he is my only node still on STCS. In cassandra-cli, it shows LCS but on disk, I see a 36Gig file(ie. Must be STCS still). How can I get the 13k pending tasks to start running? Nodetool compactionstats …. pending tasks: 13793 Active compaction remaining time :n/a Thanks, Dean
Re: 13k pending compaction tasks but ZERO running?
One more warning (which I'm sure you know, but in case others see this), nodetool compact does a major compaction for STS, and is in general, not recommended for STS. I only ran it on the tables we've converted to LCS. -Mike On Mar 14, 2013, at 11:26 AM, Michael Theroux wrote: Hi Dean, I saw the same behavior when we switched from STCS to LCS on a couple of our tables. Not sure why it doesn't proceed immediately (I pinged the list, but didn't get any feedback). However, running nodetool compact keyspace table got things moving for me. -Mike On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote: How do I get my node to run through the 13k pending compaction tasks? I had to use iptables to take the ring out of the cluster for now and he is my only node still on STCS. In cassandra-cli, it shows LCS but on disk, I see a 36Gig file(ie. Must be STCS still). How can I get the 13k pending tasks to start running? Nodetool compactionstats …. pending tasks: 13793 Active compaction remaining time :n/a Thanks, Dean
Re: About the heap
Hi Aaron, If you have the chance, could you expand on m1.xlarge being the much better choice? We are going to need to make a choice of expanding from a 12 node - 24 node cluster using .large instances, vs. upgrading all instances to m1.xlarge, soon and the justifications would be helpful (although Aaron says so does help ;) ). One obvious reason is administrating a 24 node cluster does add person-time overhead. Another reason includes less impact of maintenance activities such as repair, as these activites have significant CPU overhead. Doubling the cluster size would, in theory, halve the time for this overhead, but would still impact performance during that time. Going to xlarge would lessen the impact of these activities on operations. Anything else? Thanks, -Mike On Mar 14, 2013, at 9:27 AM, aaron morton wrote: Because of this I have an unstable cluster and have no other choice than use Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes. m1.xlarge is a MUCH better choice than m1.large. You get more ram and better IO and less steal. Using half as many m1.xlarge is the way to go. My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the max 8 GB (crashing the node). How is it crashing ? Are you getting too much GC or running OOM ? Are you using the default GC configuration ? Is cassandra logging a lot of GC warnings ? If you are running OOM then something has to change. Maybe bloom filters, maybe caches. Enable the GC logging in cassandra-env.sh to check how low a CMS compaction get's the heap, or use some other tool. That will give an idea of how much memory you are using. Here is some background on what is kept on heap in pre 1.2 http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 13/03/2013, at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote: Here is the JIRA I submitted regarding the ancestor. https://issues.apache.org/jira/browse/CASSANDRA-5342 -Wei - Original Message - From: Wei Zhu wz1...@yahoo.com To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:35:29 AM Subject: Re: About the heap Hi Dean, The index_interval is controlling the sampling of the SSTable to speed up the lookup of the keys in the SSTable. Here is the code: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478 To increase the interval meaning, taking less samples, less memory, slower lookup for read. I did do a heap dump on my production system which caused about 10 seconds pause of the node. I found something interesting, for LCS, it could involve thousands of SSTables for one compaction, the ancestors are recorded in case something goes wrong during the compaction. But those are never removed after the compaction is done. In our case, it takes about 1G of heap memory to store that. I am going to submit a JIRA for that. Here is the culprit: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58 Enjoy looking at Cassandra code:) -Wei - Original Message - From: Dean Hiller dean.hil...@nrel.gov To: user@cassandra.apache.org Sent: Wednesday, March 13, 2013 11:11:14 AM Subject: Re: About the heap Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which gave us smaller bloomfilters. As far as key cache. There is an entry in cassandra.yaml called index_interval set to 128. I am not sure if that is related to key_cache. I think it is. By turning that to 512 or maybe even 1024, you will consume less ram there as well though I ran this test in QA and my key cache size stayed the same so I am really not sure(I am actually checking out cassandra code now to dig a little deeper into this property. Dean From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, March 13, 2013 10:11 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: About the heap Hi, I would like to know everything that is in the heap. We are here speaking of C*1.1.6 Theory : - Memtable (1024 MB) - Key Cache (100 MB) - Row Cache (disabled, and serialized with JNA activated anyway, so should be off-heap) - BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter Space Used and considering they are showed in Bytes - 1103765112) - Anything else ? So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly (from the new BF of my new data). My heap is actually changing from 3-4
Re: Bloom filters and LCS
I think my impression that Bloom Filters were off in 1.1.9 was a misinterpretation of this thread: http://www.mail-archive.com/user@cassandra.apache.org/msg27787.html and this bug: https://issues.apache.org/jira/browse/CASSANDRA-5029 I read it that Bloom filters were added to 1.2.2 for Bloomfilters, but apparently after being shutoff in an earlier version of 1.2? -Mike On Mar 7, 2013, at 4:48 PM, Edward Capriolo wrote: I read that the change was made because Cassandra does not work well when they are off. This makes sense because cassandra uses bloom filters to decide if a row can be deleted without major compaction. However since LCS does not major compact without bloom filters you can end up in cases where rows never get deleted. Edward On Thu, Mar 7, 2013 at 4:30 PM, Wei Zhu wz1...@yahoo.com wrote: Where did you read that bloom filters are off for LCS on 1.1.9? Those are the two issues I can find regarding this matter: https://issues.apache.org/jira/browse/CASSANDRA-4876 https://issues.apache.org/jira/browse/CASSANDRA-5029 Looks like in 1.2, it defaults at 0.1, not sure about 1.1.X -Wei - Original Message - From: Michael Theroux mthero...@yahoo.com To: user@cassandra.apache.org Sent: Thursday, March 7, 2013 1:18:38 PM Subject: Bloom filters and LCS Hello, (Hopefully) Quick question. We are running Cassandra 1.1.9. I recently converted some tables from Size tiered to Leveled Compaction. The amount of space for Bloom Filters on these tables went down tremendously (which is expected, LCS in 1.1.9 does not use bloom filters). However, although its far less, its still using a number of megabytes. Why is it not zero? Column Family: SSTable count: 526 Space used (live): 7251063348 Space used (total): 7251063348 Number of Keys (estimate): 23895552 Memtable Columns Count: 45719 Memtable Data Size: 21207173 Memtable Switch Count: 579 Read Count: 21773431 Read Latency: 4.155 ms. Write Count: 16183367 Write Latency: 0.029 ms. Pending Tasks: 0 Bloom Filter False Positives: 2442 Bloom Filter False Ratio: 0.00245 Bloom Filter Space Used: 44674656 Compacted row minimum size: 73 Compacted row maximum size: 105778 Compacted row mean size: 1104 Thanks, -Mike
Re: Size Tiered - Leveled Compaction
I've asked this myself in the past... fairly arbitrarily chose 10MB based on Wei's experience, -Mike On Mar 8, 2013, at 1:50 PM, Hiller, Dean wrote: +1 (I would love to know this info). Dean From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Date: Friday, March 8, 2013 11:11 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Size Tiered - Leveled Compaction I have the same wonder. We started with the default 5M and the compaction after repair takes too long on 200G node, so we increase the size to 10M sort of arbitrarily since there is not much documentation around it. Our tech op team still thinks there are too many files in one directory. To fulfill the guidelines from them (don't remember the exact number, but something in the range of 50K files), we will need to increase the size to around 50M. I think the latency of opening one file is not impacted much by the number of files in one directory for the modern file system. But ls and other operations suffer. Anyway, I asked about the side effect of the bigger SSTable in IRC, someone was mentioning during read, C* reads the whole SSTable from disk in order to access the row which causes more disk IO compared with the smaller SSTable. I don't know enough about the internal of the Cassandra, not sure whether it's the case or not. If that is the case (with question mark) , the SSTable or the row is kept in the memory? Hope someone can confirm the theory here. Or I have to dig in to the source code to find it. Another concern is during repair, does it stream the whole SSTable or the partial of it when mismatch is detected? I see the claim for both, can someone please confirm also? The last thing is the effectiveness of the parallel LCS on 1.2. It takes quite some time for the compaction to finish after repair for LCS for 1.1.X. Both CPU and disk Util is low during the compaction which means LCS doesn't fully utilized resource. It will make the life easier if the issue is addressed in 1.2. Bottom line is that there is not much documentation/guideline/successful story around LCS although it sounds beautiful on paper. Thanks. -Wei From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com Sent: Friday, March 8, 2013 1:25 AM Subject: Re: Size Tiered - Leveled Compaction I'm still wondering about how to chose the size of the sstable under LCS. Defaul is 5MB, people use to configure it to 10MB and now you configure it at 128MB. What are the benefits or inconveniants of a very small size (let's say 5 MB) vs big size (like 128MB) ? Alain 2013/3/8 Al Tobey a...@ooyala.commailto:a...@ooyala.com We saw the exactly the same thing as Wei Zhu, 100k tables in a directory causing all kinds of issues. We're running 128MiB ssTables with LCS and have disabled compaction throttling. 128MiB was chosen to get file counts under control and reduce the number of files C* has to manage search. I just looked and a ~250GiB node is using about 10,000 files, which is quite manageable. This configuration is running smoothly in production under mixed read/write load. We're on RAID0 across 6 15k drives per machine. When we migrated data to this cluster we were pushing well over 26k/s+ inserts with CL_QUORUM. With compaction throttling enabled at any rate it just couldn't keep up. With throttling off, it runs smoothly and does not appear to have an impact on our applications, so we always leave it off, even in EC2. An 8GiB heap is too small for this config on 1.1. YMMV. -Al Tobey On Thu, Feb 14, 2013 at 12:51 PM, Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:)
Bloom filters and LCS
Hello, (Hopefully) Quick question. We are running Cassandra 1.1.9. I recently converted some tables from Size tiered to Leveled Compaction. The amount of space for Bloom Filters on these tables went down tremendously (which is expected, LCS in 1.1.9 does not use bloom filters). However, although its far less, its still using a number of megabytes. Why is it not zero? Column Family: SSTable count: 526 Space used (live): 7251063348 Space used (total): 7251063348 Number of Keys (estimate): 23895552 Memtable Columns Count: 45719 Memtable Data Size: 21207173 Memtable Switch Count: 579 Read Count: 21773431 Read Latency: 4.155 ms. Write Count: 16183367 Write Latency: 0.029 ms. Pending Tasks: 0 Bloom Filter False Positives: 2442 Bloom Filter False Ratio: 0.00245 Bloom Filter Space Used: 44674656 Compacted row minimum size: 73 Compacted row maximum size: 105778 Compacted row mean size: 1104 Thanks, -Mike
Re: -pr vs. no -pr
The way I've always thought about it is that -pr will make sure the information that specific node originates is consistent with its replicas. So, we know that a node is responsible for a specific token range, and the next nodes in the ring will hold its replicas. The -pr will make sure that a specific node's information is consistent to its replicas, but will not make sure a specific node has all the replicated information it can get from nodes previous to itself in the ring. Without the -pr option, not only will the current node make sure its information and its replica's information is consistent, but it will also make sure that all the information that it is a replica for, is consistent. If you run regular repairs on all the nodes in your cluster, then -pr is sufficient. Every node will run repair, and make sure its information is consistent with its replicas, eventually creating a fully consistent cluster. This is a quicker process, and will have less impact on your operations by essentially spreading out the pain. For instance, we run a 12 node cluster. We run nodetool repair -pr on nodes that are opposite to each other, 4 nodes a day (2 nodes in the morning, 2 nodes in the evening). With a grace period of 10 days, this allows us to run repairs twice a week on a specific node, and to occasionally skip repairs on specific nodes once a week. In this case, without -pr, a lot of extra work would be done. In fact, with an RF of 3 (in our case), the time per repair would increase many fold. Another way to thing about it... although likely not 100% technically correct.. A repair -pr will cause a push of a node's information to its replicas. Without the -pr, it will cause a push, and it will cause nodes it is a replica for to push their information as well. -Mike On Feb 28, 2013, at 9:39 PM, Hiller, Dean wrote: Isn't there more to it than that. You really have nodes responsible for token ranges like so(using describe ring) What we see is this from our describe ringŠ(1 to 6 are token ranges while A to F are servers)Š. A - 1, 2, 3 B - 2, 3, 4 C - 3, 4, 5 D - 4, 5, 6 E - 5, 6, 1 F - 6, 1, 2 With -pr, only token range 1 is repaired I think, right? 2 and 3 are only repaired without the -pr option? This means if I have a node that I just joined the cluster, I should not be using -pr as 2 and 3 on node A will not be up to date. Using -pr is nice if I am going to repair every single node and is nice for the cron job that has to happen before gc_grace_seconds. Am I wrong here? Ie. -pr is really only good for use in the cron job as it would miss 2 and 3 above. I could run the cron on just two servers but then my nodes are different which can be a hassle. Please verify that is what you believe is what happens as well? Thanks, Dean On 2/28/13 5:58 PM, Takenori Sato(Cloudian) ts...@cloudian.com wrote: Hi, Please note that I confirmed on v1.0.7. I mean a repair involves all three nodes and pushes and pulls data, right? Yes, but that's how -pr works. A repair without -pr does more. For example, suppose you have a ring with RF=3 like this. A - B - C - D - E - F Then, a repair on A without -pr does for 3 ranges as follows: [A, B, C] [E, F, A] [F, A, B] Among them, the first one, [A, B, C] is the primary range of A. So, with -pr, a repair runs only for: [A, B, C] I could run nodetool repair on just 2 nodes(RF=3) instead of using nodetool repair pr??? Yes. You need to run two repairs on A and D. What is the advantage of pr then? Whenever you want to minimize rapair impacts. For example, suppose you got one node down for a while, and bring it back to the cluster. You need to run rapair without affecting the entire cluster. Then, -pr is the option. Thanks, Takenori (2013/03/01 7:39), Hiller, Dean wrote: Isn't it true if I have 6 nodes, I could run nodetool repair on just 2 nodes(RF=3) instead of using nodetool repair pr??? What is the advantage of pr then? I mean a repair involves all three nodes and pushes and pulls data, right? Thanks, Dean
Re: Size Tiered - Leveled Compaction
BTW, when I say major compaction, I mean running the nodetool compact command (which does a major compaction for Sized Tiered Compaction). I didn't see the distribution of SSTables I expected until I ran that command, in the steps I described below. -Mike On Feb 14, 2013, at 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei From: Charles Brophy cbro...@zulily.com To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:29 AM Subject: Re: Size Tiered - Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered - Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running nodetool sstableupgrade on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Read operations resulting in a write?
Hello, We have an unusual situation that I believe I've reproduced, at least temporarily, in a test environment. I also think I see where this issue is occurring in the code. We have a specific column family that is under heavy read and write load on a nightly basis. For the purposes of this description, I'll refer to this column family as Bob. During this nightly processing, sometimes Bob is under very write load, other times it is very heavy read load. The application is such that when something is written to Bob, a write is made to one of two other tables. We've witnessed a situation where the write count on Bob far outstrips the write count on either of the other tables, by a factor of 3-10. This is based on the WriteCount available on the column family JMX MBean. We have not been able to find where in our code this is happening, and we have gone as far as tracing our CQL calls to determine that the relationship between Bob and the other tables are what we expect. I brought up a test node to experiment, and see a situation where, when a select statement is executed, a write will occur. In my test, I perform the following (switching between nodetool and cqlsh): update bob set 'about'='coworker' where key='hex key'; nodetool flush update bob set 'about'='coworker' where key='hex key'; nodetool flush update bob set 'about'='coworker' where key='hex key'; nodetool flush update bob set 'about'='coworker' where key='hex key'; nodetool flush update bob set 'about'='coworker' where key='hex key'; nodetool flush Then, for a period of time (before a minor compaction occurs), a select statement that selects specific columns will cause writes to occur in the write count of the column family: select about,changed,data from bob where key='hex key'; This situation will continue until a minor compaction is completed. I went into the code and added some traces to CollationController.java: private ColumnFamily collectTimeOrderedData() { logger.debug(collectTimeOrderedData); ... snip ... --- HERE logger.debug( tables iterated: + sstablesIterated + Min compact: + cfs.getMinimumCompactionThreshold() ); // hoist up the requested data into a more recent sstable if (sstablesIterated cfs.getMinimumCompactionThreshold() !cfs.isCompactionDisabled() cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy) { RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe())); try { --- HERE logger.debug( Apply hoisted up row mutation ); // skipping commitlog and index updates is fine since we're just de-fragmenting existing data Table.open(rm.getTable()).apply(rm, false, false); } catch (IOException e) { // log and allow the result to be returned logger.error(Error re-writing read results, e); } } ... snip ... Performing the steps above, I see the following traces (in the test environment I decreased the minimum compaction threshold to make this easier to reproduce). After I do a couple of update/flush, I see this in the log: DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 117) Scheduling a background task check for bob with SizeTieredCompactionStrategy Then, until compaction occurs, I see (when performing a select): DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) Disseminating load info ... DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) execute_cql_query DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL statement type: SELECT DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])/ONE DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor is 1; setting up requests to /10.0.4.20 DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) reading data locally DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) LocalReadRunnable reading SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,]) DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 68) In get top level columns: class org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class org.apache.cassandra.db.marshal.BytesType DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 84) collectTimeOrderedData --- DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 188) tables iterated: 4 Min compact: 2 DEBUG [ReadStage:61] 2012-12-14 22:55:17,192
Cassandra compression not working?
Hello, We are running into an unusual situation that I'm wondering if anyone has any insight on. We've been running a Cassandra cluster for some time, with compression enabled on one column family in which text documents are stored. We enabled compression on the column family, utilizing the SnappyCompressor and a 64k chunk length. It was recently discovered that Cassandra was reporting a compression ratio of 0. I took a snapshot of the data and started a cassandra node in isolation to investigate. Running nodetool scrub, or nodetool upgradesstables had little impact on the amount of data that was being stored. I then disabled compression and ran nodetool upgradesstables on the column family. Again, not impact on the data size stored. I then reenabled compression and ran nodetool upgradesstables on the column family. This resulting in a 60% reduction in the data size stored, and Cassandra reporting a compression ration of about .38. Any idea what is going on here? Obviously I can go through this process in production to enable compression, however, any idea what is currently happening and why new data does not appear to be compressed? Any insights are appreciated, Thanks, -Mike
Re: Cassandra Messages Dropped
There were no errors in the log (other than the messages dropped exception pasted below), and the node does recover. We have only a small number of secondary indexes (3 in the whole system). However, I went through the cassandra code, and I believe I've worked through this problem. Just to finish out this thread, I realized that when you see: INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) FlushWriter 1 5 0 It is an issue. Cassandra will at various times enqueue many memtables for flushing. By default, the queue size for this is 4. If more than 5 memtables get queued for flushing (4 + 1 for the one currently being flushed), a lock will be acquired and held across all tables until all memtables that need to be flushed are enqueued. If it takes more than rpc_timeout_time_in_ms time to flush enough information to allow all the pending memtables to be enqueued, a messages dropped will occur. To put in other words, Cassandra will lock down all tables until all pending flush requests fit in the pending queue. If your queue size is 4, and 8 tables need to be flushed, Cassandra will lock down all tables until a minimum of 3 memtables are flushed. With this in mind, I went through the cassandra log and found this was indeed the case looking at log entries similar to these: INFO [OptionalTasks:1] 2012-09-16 05:54:29,750 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-p@1525015234(18686281/341486464 serialized/live bytes, 29553 ops) ... INFO [FlushWriter:29] 2012-09-16 05:54:29,768 Memtable.java (line 266) Writing Memtable-p@1525015234(18686281/341486464 serialized/live bytes, 29553 ops) ... INFO [FlushWriter:29] 2012-09-16 05:54:30,254 Memtable.java (line 307) Completed flushing /data/cassandra/data/open/people/open-p-hd-441-Data.db I was able to figure out what the rpc_timeout_in_ms needed to be to temporarily prevent the problem. We had plenty of write I/O available. We also had free memory. I increased the memtable_flush_writers to 2 and memtable_flush_queue_size to 8. We haven't had any timeouts for a number of days now. Thanks for your help, -Mike On Sep 18, 2012, at 5:14 AM, aaron morton wrote: Any errors in the log ? The node recovers ? Do you use secondary indexes ? If so check comments for memtable_flush_queue_size in the yaml. if this value is too low writes may back up. But I would not expect it to cause dropped messages. nodetool info also shows we have over a gig of available memory on the JVM heap of each node. Not all memory is created equal :) ParNew is kicking in to GC the Eden space in the New Heap. It may just be that the node is getting hammered by something and IO is getting overwhelmed. If you can put the logs up someone might take a look. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 18/09/2012, at 3:46 PM, Michael Theroux mthero...@yahoo.com wrote: Thanks for the response. We are on version 1.1.2. We don't see the MutationStage back up. The dump from the messages dropped error doesn't show a backup, but also watching nodetool tpstats doesn't show any backup there. nodetool info also shows we have over a gig of available memory on the JVM heap of each node. The earliest GCInspector traces I see before one of the more recent incidents in which messages were dropped are: INFO [ScheduledTasks:1] 2012-09-18 02:25:53,928 GCInspector.java (line 122) GC for ParNew: 396 ms for 1 collections, 2064505088 used; max is 4253024256 NFO [ScheduledTasks:1] 2012-09-18 02:25:55,929 GCInspector.java (line 122) GC for ParNew: 485 ms for 1 collections, 1961875064 used; max is 4253024256 INFO [ScheduledTasks:1] 2012-09-18 02:25:57,930 GCInspector.java (line 122) GC for ParNew: 265 ms for 1 collections, 1968074096 used; max is 4253024256 But this was 45 minutes before messages were dropped. It's appreciated, -Mike On Sep 17, 2012, at 11:27 PM, aaron morton wrote: INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) MemtablePostFlusher 1 5 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) FlushWriter 1 5 0 Looks suspiciously like http://mail-archives.apache.org/mod_mbox/cassandra-user/201209.mbox/%3c9fb0e801-b1ed-41c4-9939-bafbddf15...@thelastpickle.com%3E What version are you on ? Are there any ERROR log messages before this ? Are you seeing MutationStage back up ? Are you see log messages from GCInspector ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 18/09/2012, at 2:16 AM, Michael Theroux mthero...@yahoo.com wrote: Hello, While under load, we have occasionally been seeing messages dropped errors
Secondary index loss on node restart
Hello, We have been noticing an issue where, about 50% of the time in which a node fails or is restarted, secondary indexes appear to be partially lost or corrupted. A drop and re-add of the index appears to correct the issue. There are no errors in the cassandra logs that I see. Part of the index seems to be simply missing. Sometimes this corruption/loss doesn't happen immediately, but sometime after the node is restarted. In addition, the index never appears to have an issue when the node comes down, it is only after the node comes back up and recovers in which we experience an issue. We developed some code that goes through all the rows in the table, by key, in which the index is present. It then attempts to look up the information via secondary index, in an attempt to detect when the issue occurs. Another odd observation is that the number of members present in the index when we have the issue varies up and down (the index and the tables don't change that often). We are running a 6 node Cassandra cluster with a replication factor of 3, consistency level for all queries is LOCAL_QUORUM. We are running Cassandra 1.1.2. Anyone have any insights? -Mike
Re: Cassandra Messages Dropped
Love the Mars lander analogies :) On Sep 23, 2012, at 5:39 PM, aaron morton wrote: To put in other words, Cassandra will lock down all tables until all pending flush requests fit in the pending queue. This was the first issue I looked at in my Cassandra SF talk http://www.datastax.com/events/cassandrasummit2012/presentations I've seen it occur more often with lots-o-secondary indexes. We had plenty of write I/O available. We also had free memory. I increased the memtable_flush_writers to 2 and memtable_flush_queue_size to 8. We haven't had any timeouts for a number of days now. Cool. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/09/2012, at 6:09 AM, Michael Theroux mthero...@yahoo.com wrote: There were no errors in the log (other than the messages dropped exception pasted below), and the node does recover. We have only a small number of secondary indexes (3 in the whole system). However, I went through the cassandra code, and I believe I've worked through this problem. Just to finish out this thread, I realized that when you see: INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) FlushWriter 1 5 0 It is an issue. Cassandra will at various times enqueue many memtables for flushing. By default, the queue size for this is 4. If more than 5 memtables get queued for flushing (4 + 1 for the one currently being flushed), a lock will be acquired and held across all tables until all memtables that need to be flushed are enqueued. If it takes more than rpc_timeout_time_in_ms time to flush enough information to allow all the pending memtables to be enqueued, a messages dropped will occur. To put in other words, Cassandra will lock down all tables until all pending flush requests fit in the pending queue. If your queue size is 4, and 8 tables need to be flushed, Cassandra will lock down all tables until a minimum of 3 memtables are flushed. With this in mind, I went through the cassandra log and found this was indeed the case looking at log entries similar to these: INFO [OptionalTasks:1] 2012-09-16 05:54:29,750 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-p@1525015234(18686281/341486464 serialized/live bytes, 29553 ops) ... INFO [FlushWriter:29] 2012-09-16 05:54:29,768 Memtable.java (line 266) Writing Memtable-p@1525015234(18686281/341486464 serialized/live bytes, 29553 ops) ... INFO [FlushWriter:29] 2012-09-16 05:54:30,254 Memtable.java (line 307) Completed flushing /data/cassandra/data/open/people/open-p-hd-441-Data.db I was able to figure out what the rpc_timeout_in_ms needed to be to temporarily prevent the problem. We had plenty of write I/O available. We also had free memory. I increased the memtable_flush_writers to 2 and memtable_flush_queue_size to 8. We haven't had any timeouts for a number of days now. Thanks for your help, -Mike On Sep 18, 2012, at 5:14 AM, aaron morton wrote: Any errors in the log ? The node recovers ? Do you use secondary indexes ? If so check comments for memtable_flush_queue_size in the yaml. if this value is too low writes may back up. But I would not expect it to cause dropped messages. nodetool info also shows we have over a gig of available memory on the JVM heap of each node. Not all memory is created equal :) ParNew is kicking in to GC the Eden space in the New Heap. It may just be that the node is getting hammered by something and IO is getting overwhelmed. If you can put the logs up someone might take a look. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 18/09/2012, at 3:46 PM, Michael Theroux mthero...@yahoo.com wrote: Thanks for the response. We are on version 1.1.2. We don't see the MutationStage back up. The dump from the messages dropped error doesn't show a backup, but also watching nodetool tpstats doesn't show any backup there. nodetool info also shows we have over a gig of available memory on the JVM heap of each node. The earliest GCInspector traces I see before one of the more recent incidents in which messages were dropped are: INFO [ScheduledTasks:1] 2012-09-18 02:25:53,928 GCInspector.java (line 122) GC for ParNew: 396 ms for 1 collections, 2064505088 used; max is 4253024256 NFO [ScheduledTasks:1] 2012-09-18 02:25:55,929 GCInspector.java (line 122) GC for ParNew: 485 ms for 1 collections, 1961875064 used; max is 4253024256 INFO [ScheduledTasks:1] 2012-09-18 02:25:57,930 GCInspector.java (line 122) GC for ParNew: 265 ms for 1 collections, 1968074096 used; max is 4253024256 But this was 45 minutes before messages were dropped. It's appreciated, -Mike On Sep 17, 2012, at 11:27 PM, aaron morton wrote: INFO [ScheduledTasks:1
Cassandra Messages Dropped
Hello, While under load, we have occasionally been seeing messages dropped errors in our cassandra log. Doing some research, I understand this is part of Cassandra's design to shed load, and we should look at the tpstats-like output to determine what should be done to resolve the situation. Typically, you will see lots of messages blocked or pending, and that might be an indicator that a specific part of hardware needs to be improved/tuned/upgraded. However, looking at the output we are getting, I'm finding it difficult to see what needs to be tuned, as it looks to me cassandra is handling the load within the mutation stage: NFO [ScheduledTasks:1] 2012-09-17 06:28:03,266 MessagingService.java (line 658) 119 MUTATION messages dropped in last 5000ms INFO [ScheduledTasks:1] 2012-09-17 06:28:03,645 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2012-09-17 06:28:03,836 StatusLogger.java (line 72) ReadStage 3 3 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,837 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,837 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,837 StatusLogger.java (line 72) MutationStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,838 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,838 StatusLogger.java (line 72) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,839 StatusLogger.java (line 72) MemtablePostFlusher 1 5 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) FlushWriter 1 5 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,840 StatusLogger.java (line 72) commitlog_archiver0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,841 StatusLogger.java (line 72) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,841 StatusLogger.java (line 72) AntiEntropySessions 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,851 StatusLogger.java (line 72) HintedHandoff 0 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,851 StatusLogger.java (line 77) CompactionManager 0 0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,852 StatusLogger.java (line 89) MessagingServicen/a 0,0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,852 StatusLogger.java (line 99) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 100) KeyCache2184533 2184533 all INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 106) RowCache 00 all org.apache.cassandra.cache.SerializingCacheProvider INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 113) ColumnFamilyMemtable ops,data INFO [ScheduledTasks:1] 2012-09-17 06:28:03,853 StatusLogger.java (line 116) system.NodeIdInfo 0,0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,854 StatusLogger.java (line 116) system.IndexInfo 0,0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,854 StatusLogger.java (line 116) system.LocationInfo 0,0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,854 StatusLogger.java (line 116) system.Versions 0,0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,855 StatusLogger.java (line 116) system.schema_keyspaces 0,0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,855 StatusLogger.java (line 116) system.Migrations 0,0 INFO [ScheduledTasks:1] 2012-09-17 06:28:03,855 StatusLogger.java (line 116) system.schema_columnfamilies 0,0
Cassandra, AWS and EBS Optimized Instances/Provisioned IOPs
Hello, A number of weeks ago, Amazon announced the availability of EBS Optimized instances and Provisioned IOPs for Amazon EC2. Historically, I've read EBS is not recommended for Cassandra due to the network contention that can quickly result (http://www.datastax.com/docs/1.0/cluster_architecture/cluster_planning). Costs put aside, and assuming everything promoted by Amazon is accurate, with the existence of provisioned IOPs, is EBS now a better option than before? Taking the points against EBS mentioned in the link above: EBS volumes contend directly for network throughput with standard packets. This means that EBS throughput is likely to fail if you saturate a network link. According to Amazon, Provisioned IOPS guarantees to be within 10% of of the provisioned performance 99.9%of the time. This would mean that throughput should no longer fail. EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the system to backload reads and writes until the entire cluster becomes unresponsive. Same point as above. Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all of the data it is responsible for managing. I believe this may be still true, although I'm not entirely sure why this is more true for EBS volumes vs. emphemeral. Any real world experience out there with these new EBS options? -Mike
Re: nodetool repair
So, if I have a 6 node cluster in the token ring, A-B-C-D-E-F, replication factor 3, and I run repair (without -pr) on A, is the flow of information: A synchronizes information it is responsible for with B and C (because B and C are replicas of A). A, as a replica of E and F, synchronizes E and F's information to itself. If I re-ran repair of E, it would actually be redoing the work of synchronizing with A, but will still be doing the worthwhile work of synchronizing with F. Running repair with -pr will prevent this duplicate work, if you ran it on each node. If I ran repair on A with -pr, A synchronizes its primary range with B and C, but does not perform synchronization work with E and F. Does this sound correct? -Mike On Jul 15, 2012, at 9:47 AM, Edward Capriolo wrote: Great job sleuthing. Originally repair did not have a -pr. When you run the standard repair the node compares it's data with its neighbours and vice versa. They also send each other updates. Since you are supposed to repair every node gc_grace submitting a full repair to each node would create duplicated work, since a repair on node A has an effect on node B and node C. If you want to understand this some more you should run compactionstats and netstats across your cluster while a repair is going on, then you can see what effect the commands have on other nodes. I will try to write up some documentation on it as well because -pr is a nice feature. Many may not even be expressly aware of it. On Sat, Jul 14, 2012 at 2:00 PM, Michael Theroux mthero...@yahoo.com wrote: Hello, I'm looking at nodetool repair with the -pr, vs. non -pr option. Looking around, I'm seeing a lot of conflicting information out there. Almost universally, the recommendation is to run nodetool repair with the -pr for any day-to-day maintenance. This is my understanding of how it works. I appreciate any corrections to my misinformation. nodetool repair -pr - This performs a repair on the primary range of the node. The primary range is essentially the part of the ring that the node is responsible for. When this command is run, synchronization of replicas will occur for the rows that this node is responsible for. If replicas are missing from that node's neighbors for those rows, they will be replicated. nodetool repair - This is where I see a lot of conflicting information. I see a lot of answers in which there is a suggestion that this command will perform a repair across the entire cluster. However, I don't believe this is true from my observations (and some of the items I read seems to agree with this). Instead, this command performs synchronization of your primary range, but also for other ranges that this node maybe responsible for in a replica capacity. The way I'm thinking about it is that the -pr option causes repairs to push information from its primary range to replicas. Without -pr, nodetool replair does a push, and pull for its neighbors that this node maybe a replica for. This makes sense to me, as people recommend running nodetool repair after a node has been down. This is to allow the downed node to get any missed information that should have been replicated to it while it was down. I'm sure there lots of flaws to the above understanding as I'm cobbling it together. I appreciate the feedback, -Mike
Re: Increased replication factor not evident in CLI
Just to completely eliminate the possibility of the same bug, if you look here: http://www.mail-archive.com/dev@cassandra.apache.org/msg04992.html If you create a test keyspace, and look at the timestamp in the schema_keyspaces column family in comparison to your existing keyspace, is that timestamp greater? Thanks, -Mike On Jul 12, 2012, at 8:56 PM, Michael Theroux wrote: Sounds a lot like a bug that I hit that was filed and fixed recently: https://issues.apache.org/jira/browse/CASSANDRA-4432 -Mike On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote: Possibly the bug with nanotime causing cassandra to think the change happened in the past. Talked about onlist in past few days. On Thursday, July 12, 2012, aaron morton aa...@thelastpickle.com wrote: Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? Do show schema and show keyspace say the same thing ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 13/07/2012, at 7:39 AM, Dustin Wenz wrote: We recently increased the replication factor of a keyspace in our cassandra 1.1.1 cluster from 2 to 4. This was done by setting the replication factor to 4 in cassandra-cli, and then running a repair on each node. Everything seems to have worked; the commands completed successfully and disk usage increased significantly. However, if I perform a describe on the keyspace, it still shows replication_factor:2. So, it appears that the replication factor might be 4, but it reports as 2. I'm not entirely sure how to confirm one or the other. Since then, I've stopped and restarted the cluster, and even ran an upgradesstables on each node. The replication factor still doesn't report as I would expect. Am I missing something here? - .Dustin
Re: Increased replication factor not evident in CLI
Sounds a lot like a bug that I hit that was filed and fixed recently: https://issues.apache.org/jira/browse/CASSANDRA-4432 -Mike On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote: Possibly the bug with nanotime causing cassandra to think the change happened in the past. Talked about onlist in past few days. On Thursday, July 12, 2012, aaron morton aa...@thelastpickle.com wrote: Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? Do show schema and show keyspace say the same thing ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 13/07/2012, at 7:39 AM, Dustin Wenz wrote: We recently increased the replication factor of a keyspace in our cassandra 1.1.1 cluster from 2 to 4. This was done by setting the replication factor to 4 in cassandra-cli, and then running a repair on each node. Everything seems to have worked; the commands completed successfully and disk usage increased significantly. However, if I perform a describe on the keyspace, it still shows replication_factor:2. So, it appears that the replication factor might be 4, but it reports as 2. I'm not entirely sure how to confirm one or the other. Since then, I've stopped and restarted the cluster, and even ran an upgradesstables on each node. The replication factor still doesn't report as I would expect. Am I missing something here? - .Dustin
Re: Serious issue updating Cassandra version and topology
Hello Aaron, Thank you for responding. Since the time of my original email, we noticed that in the process of performing this upgrade that data was lost. We have restored from backup and are now trying this again with two changes: 1) We will be using 1.1.2 throughout the cluster 2) We have switched back to Tiered compaction In the process I've hit another very interesting issue that I will write a separate email about. However, to answer your questions, this happened on the 1.1.2 node and it happened against after you ran the scrub. The data has been around for a while. We upgraded from 1.0.7 - 1.1.2. Unfortunately, I can't check the sstables as we've restarted the migration from the beginning. If it happens again, I'll respond with more information. Thanks again, -Mike On Jul 10, 2012, at 5:05 AM, aaron morton wrote: To be clear, this happened on a 1.1.2 node and it happened again *after* you had run a scrub ? Has this cluster been around for a while or was the data created with 1.1 ? Can you confirm that all sstables were re-written for the CF? Check the timestamp on the files. Also also files should have the same version, the -h?- part of the name. Can you repair the other CF's ? If this cannot be repaired by scrub or upgradetables you may need to cut the row out of the sstables. Using sstable2json and json2sstable. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/07/2012, at 4:05 PM, Michael Theroux wrote: Hello, We're in the process of trying to move a 6-node cluster from RF=1 to RF=3. Once our replication factor was upped to 3, we ran nodetool repair, and immediately hit an issue on the first node we ran repair on: INFO 03:08:51,536 Starting repair command #1, repairing 2 ranges. INFO 03:08:51,552 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] new session: will sync xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101, /10.29.187.61 on range (Token(bytes[d558]),Token(bytes[])] for x.[a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s] INFO 03:08:51,555 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting merkle trees for a (to [/10.29.187.61, xxx-xx-xx-xxx-compute-1.amazonaws.com/10.202.99.101]) INFO 03:08:52,719 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received merkle tree for a from /10.29.187.61 INFO 03:08:53,518 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received merkle tree for a from xxx-xx-xx-xxx-.compute-1.amazonaws.com/10.202.99.101 INFO 03:08:53,519 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] requesting merkle trees for b (to [/10.29.187.61, xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101]) INFO 03:08:53,639 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Endpoints /10.29.187.61 and xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101 are consistent for a INFO 03:08:53,640 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] a is fully synced (18 remaining column family to sync for this session) INFO 03:08:54,049 [repair #3e724fe0-c8aa-11e1--4f728ab9d6ff] Received merkle tree for b from /10.29.187.61 ERROR 03:09:09,440 Exception in thread Thread[ValidationExecutor:1,1,main] java.lang.AssertionError: row DecoratedKey(Token(bytes[efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47]), efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47) received out of order wrt DecoratedKey(Token(bytes[f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb]), f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb) at org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:349) at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:712) at org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:68) at org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:438) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) It looks from the log above, the sync of the a column family was successful. However, the b column family resulted in this error. In addition, the repair hung after this error. We ran node tool scrub on all nodes and invalidated the key and row caches and tried again (with RF=2), and it didn't help alleviate the problem. Some other important pieces of information: We use ByteOrderedPartitioner (we MD5 hash
Re: Expanding Cassandra on EC2 with consistency
Yes, I saw LOCAL_QUORUM. The definition I saw was: Ensure that the write has been written to ReplicationFactor / 2 + 1 nodes, within the local datacenter (requiresNetworkTopologyStrategy) This will allow a quorum within a datacenter. However, I think this means that if availability zones are racks, that the quorum would still be across availability zones. -Mike On Jul 3, 2012, at 9:22 AM, Robin Verlangen wrote: Hi Mike, I'm not sure about all your questions, however you should take a look at LOCAL_QUORUM for your question about consistency level reads/writes. 2012/7/3 Michael Theroux mthero...@yahoo.com Hello, We are currently running a web application utilizing Cassandra on EC2. Given the recent outages experienced with Amazon, we want to consider expanding Cassandra across availability zones sooner rather than later. We are trying to determine the optimal way to deploy Cassandra in this deployment. We are researching the NetworkTopologyStrategy, and the EC2Snitch. We are also interested in providing a high level of read or write consistency, My understanding is that the EC2Snitch recognizes availability zones as racks, and regions as data-centers. This seems to be a common configuration. However, if we were to want to utilize queries with a READ or WRITE consistency of QUORUM, would there be a high possibility that the communication necessary to establish a quorum, across availability zones? My understanding is that the NetworkTopologyStrategy attempts to prefer replicas be stored on other racks within the datacenter, which would equate to other availability zones in EC2. This implies to me that in order to have the quorum of nodes necessary to achieve consistency, that Cassandra will communicate with nodes across availability zones. First, is my understanding correct? Second, given the high latency that can sometimes exists between availability zones, is this a problem, and instead we should treat availability zones as data centers? Ideally, we would be able to setup a situation where we could store replicas across availability zones in case of failure, but establish a high level of read or write consistency within a single availability zone. I appreciate your responses, Thanks, -Mike -- With kind regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.