Re: How to model data to achieve specific data locality
Some of the sequences grow so fast that sub-partition is inevitable. I may need to try different bucket sizes to get the optimal throughput. Thank you all for the advice. On Mon, Dec 8, 2014 at 9:55 AM, Eric Stevens migh...@gmail.com wrote: The upper bound for the data size of a single column is 2GB, and the upper bound for the number of columns in a row (partition) is 2 billion. So if you wanted to create the largest possible row, you probably can't afford enough disks to hold it. http://wiki.apache.org/cassandra/CassandraLimitations Practically speaking you start running into troubles *way* before you reach those thresholds though. Large columns and large numbers of columns create GC pressure in your cluster, and since all data for a given row reside on the same primary and replicas, this tends to lead to hot spotting. Repair happens for entire rows, so large rows increase the cost of repairs, including GC pressure during the repair. And rows of this size are often arrived at by appending to the same row repeatedly, which will cause the data for that row to be scattered across a large number of SSTables which will hurt read performance. Also depending on your interface, you'll find you start hitting limits that you have to increase, each with their own implications (eg, maximum thrift message sizes and so forth). The right maximum practical size for a row definitely depends on your read and write patterns, as well as your hardware and network. More memory, SSD's, larger SSTables, and faster networks will all raise the ceiling for where large rows start to become painful. @Kai, if you're familiar with the Thrift paradigm, the partition key equates to a Thrift row key, and the clustering key equates to the first part of a composite column name. CQL PRIMARY KEY ((a,b), c, d) equates to Thrift where row key is ['a:b'] and all columns begin with ['c:d:']. Recommended reading: http://www.datastax.com/dev/blog/thrift-to-cql3 Whatever your partition key, if you need to sub-partition to maintain reasonable row sizes, then the only way to preserve data locality for related records is probably to switch to byte ordered partitioner, and compute blob or long column as part of your partition key that is meant to cause the PK to to map to the same token. Just be aware that byte ordered partitioner comes with a number of caveats, and you'll become responsible for maintaining good data load distributions in your cluster. But the benefits from being able to tune locality may be worth it. On Sun Dec 07 2014 at 3:12:11 PM Jonathan Haddad j...@jonhaddad.com wrote: I think he mentioned 100MB as the max size - planning for 1mb might make your data model difficult to work. On Sun Dec 07 2014 at 12:07:47 PM Kai Wang dep...@gmail.com wrote: Thanks for the help. I wasn't clear how clustering column works. Coming from Thrift experience, it took me a while to understand how clustering column impacts partition storage on disk. Now I believe using seq_type as the first clustering column solves my problem. As of partition size, I will start with some bucket assumption. If the partition size exceeds the threshold I may need to re-bucket using smaller bucket size. On another thread Eric mentions the optimal partition size should be at 100 kb ~ 1 MB. I will use that as the start point to design my bucket strategy. On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com wrote: It would be helpful to look at some specific examples of sequences, showing how they grow. I suspect that the term “sequence” is being overloaded in some subtly misleading way here. Besides, we’ve already answered the headline question – data locality is achieved by having a common partition key. So, we need some clarity as to what question we are really focusing on And, of course, we should be asking the “Cassandra Data Modeling 101” question of what do your queries want to look like, how exactly do you want to access your data. Only after we have a handle on how you need to read your data can we decide how it should be stored. My immediate question to get things back on track: When you say “The typical read is to load a subset of sequences with the same seq_id”, what type of “subset” are you talking about? Again, a few explicit and concise example queries (in some concise, easy to read pseudo language or even plain English, but not belabored with full CQL syntax.) would be very helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” command, so what are we really talking about? Also, I presume we are talking CQL, but some of the references seem more Thrift/slice oriented. -- Jack Krupansky *From:* Eric Stevens migh...@gmail.com *Sent:* Sunday, December 7, 2014 10:12 AM *To:* user@cassandra.apache.org *Subject:* Re: How to model data to achieve specific data locality Also new seq_types can be added and old seq_types can be
Re: Cassandra Files Taking up Much More Space than CF
Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: Cassandra Files Taking up Much More Space than CF
Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: Cassandra Files Taking up Much More Space than CF
You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: Cassandra Files Taking up Much More Space than CF
Thanks Jonathan. So there is nothing too idiotic about my current set-up with 6 boxes each with 256 vnodes each and a RF of 2? I appreciate the help, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad j...@jonhaddad.com wrote: You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: Cassandra Files Taking up Much More Space than CF
Well, I personally don't like RF=2. It means if you're using CL=QUORUM and a node goes down, you're going to have a bad time. (downtime) If you're using CL=ONE then you'd be ok. However, I am not wild about losing a node and having only 1 copy of my data available in prod. On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder n...@whistle.com wrote: Thanks Jonathan. So there is nothing too idiotic about my current set-up with 6 boxes each with 256 vnodes each and a RF of 2? I appreciate the help, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad j...@jonhaddad.com wrote: You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: Cassandra Files Taking up Much More Space than CF
Thanks for the advice. Totally makes sense. Once I figure out how to make my data stop taking up more than 2x more space without being useful I'll definitely make the change :) Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 9:02 AM, Jonathan Haddad j...@jonhaddad.com wrote: Well, I personally don't like RF=2. It means if you're using CL=QUORUM and a node goes down, you're going to have a bad time. (downtime) If you're using CL=ONE then you'd be ok. However, I am not wild about losing a node and having only 1 copy of my data available in prod. On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder n...@whistle.com wrote: Thanks Jonathan. So there is nothing too idiotic about my current set-up with 6 boxes each with 256 vnodes each and a RF of 2? I appreciate the help, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad j...@jonhaddad.com wrote: You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: Cassandra Files Taking up Much More Space than CF
Hi Nate, Are you using incremental backups? Extract from the documentation ( http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_backup_incremental_t.html ): /When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. This allows storing backups offsite without transferring entire snapshots. Also, incremental backups combine with snapshots to provide a dependable, up-to-date backup mechanism./ // /As with snapshots, Cassandra does not automatically clear incremental backup files. *DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created.*/ These backups are stored in directories named backups at the same level as the snapshots' directories. Reynald On 09/12/2014 18:13, Nate Yoder wrote: Thanks for the advice. Totally makes sense. Once I figure out how to make my data stop taking up more than 2x more space without being useful I'll definitely make the change :) Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com mailto:n...@whistle.com On Tue, Dec 9, 2014 at 9:02 AM, Jonathan Haddad j...@jonhaddad.com mailto:j...@jonhaddad.com wrote: Well, I personally don't like RF=2. It means if you're using CL=QUORUM and a node goes down, you're going to have a bad time. (downtime) If you're using CL=ONE then you'd be ok. However, I am not wild about losing a node and having only 1 copy of my data available in prod. On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder n...@whistle.com mailto:n...@whistle.com wrote: Thanks Jonathan. So there is nothing too idiotic about my current set-up with 6 boxes each with 256 vnodes each and a RF of 2? I appreciate the help, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com mailto:n...@whistle.com On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad j...@jonhaddad.com mailto:j...@jonhaddad.com wrote: You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com mailto:n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com mailto:n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com mailto:ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com mailto:n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been
Observations/concerns with repair and hinted handoff
I have spent a lot of time working with single-node, RF=1 clusters in my development. Before I deploy a cluster to our live environment, I have spent some time learning how to work with a multi-node cluster with RF=3. There were some surprises. I’m wondering if people here can enlighten me. I don’t exactly have that warm, fuzzy feeling. I created a three-node cluster with RF=3. I then wrote to the cluster pretty heavily to cause some dropped mutation messages. The dropped messages didn’t trickle in, but came in a burst. I suspect full GC is the culprit, but I don’t really know. Anyway, I ended up with 17197 dropped mutation messages on node 1, 6422 on node 2, and none on node 3. In order to learn about repair, I waited for compaction to finish doing its thing, recorded the size and estimated number of keys for each table, started up repair (nodetool repair keyspace) on all three nodes, and waited for it to complete before doing anything else (even reads). When repair and compaction were done, I checked the size and estimated number of keys for each table. All tables on all nodes grew in size and estimated number of keys. The estimated number of keys for each node grew by 65k, 272k and 247k (.2%, .7% and .6%) for nodes 1, 2 and 3 respectively. I expected some growth, but that’s significantly more new keys than I had dropped mutation messages. I also expected the most new data on node 1, and none on node 3, which didn’t come close to what actually happened. Perhaps a mutation message contains more than one record? Perhaps the dropped mutation message counter is incremented on the coordinator, not the node that was overloaded? I repeated repair, and the second time around the tables remained unchanged, as expected. I would hope that repair wouldn’t do anything to the tables if they were in sync. Just to be clear, I’m not overly concerned about the unexpected increase in number of keys. I’m pretty sure that repair did the needful thing and did bring the nodes in sync. The unexpected results more likely indicates that I’m ignorant, and it really bothers me when I don’t understand something. If you have any insights, I’d appreciate them. One of the dismaying things about repair was that the first time around it took about 4 hours, with a completely idle cluster (except for repairs, of course), and only 6 GB of data on each node. I can bootstrap a node with 6 GB of data in a couple of minutes. That makes repair something like 50 to 100 times more expensive than bootstrapping. I know I should run repair on one node at a time, but even if you divide by three, that’s still a horrifically long time for such a small amount of data. The second time around, repair only took 30 minutes. That’s much better, but best-case is still about 10x longer than bootstrapping. Should repair really be taking this long? When I have 300 GB of data, is a best-case repair going to take 25 hours, and a repair with a modest amount of work more than 100 hours? My records are quite small. Those 6 GB contain almost 40 million partitions. Following my repair experiment, I added a fourth node, and then tried killing a node and importing a bunch of data while the node was down. As far as repair is concerned, this seems to work fine (although again, glacially). However, I noticed that hinted handoff doesn’t seem to be working. I added several million records (with consistency=one), and nothing appeared in system.hints (du -hs showed a few dozen K bytes), nor did I get any pending Hinted Handoff tasks in the Thread Pool Stats. When I started up the down node (less than 3 hours later), the missed data didn’t appear to get sent to it. The tables did not grow, compaction events didn’t schedule, and there wasn’t any appreciable CPU utilization by the cluster. With millions of records that were missed while it was down, I should have noticed something if it actually was replaying the hints. Is there some magic setting to turn on hinted handoffs? Were there too many hints and so it just deleted them? My assumption is that if hinted handoff is working, then my need for repair should be much less, which given my experience so far, would be a really good thing. Given the horrifically long time it takes to repair a node, and hinted handoff apparently not working, if a node goes down, is it better to bootstrap a new one than to repair the node that went down? I would expect that even if I chose to bootstrap a new node, it would need to be repaired anyway, since it would probably miss writes while bootstrapping. Thanks in advance Robert
Re: Cassandra Files Taking up Much More Space than CF
Hi Reynald, Good idea but I have incremental backups turned off and other than *.db files nothing else appears to be in the data directory for that table. Is there any other output that would be helpful in helping you all help me? Thanks, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 9:27 AM, Reynald Bourtembourg reynald.bourtembo...@esrf.fr wrote: Hi Nate, Are you using incremental backups? Extract from the documentation ( http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_backup_incremental_t.html ): *When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. This allows storing backups offsite without transferring entire snapshots. Also, incremental backups combine with snapshots to provide a dependable, up-to-date backup mechanism.* *As with snapshots, Cassandra does not automatically clear incremental backup files. DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created.* These backups are stored in directories named backups at the same level as the snapshots' directories. Reynald On 09/12/2014 18:13, Nate Yoder wrote: Thanks for the advice. Totally makes sense. Once I figure out how to make my data stop taking up more than 2x more space without being useful I'll definitely make the change :) Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 9:02 AM, Jonathan Haddad j...@jonhaddad.com wrote: Well, I personally don't like RF=2. It means if you're using CL=QUORUM and a node goes down, you're going to have a bad time. (downtime) If you're using CL=ONE then you'd be ok. However, I am not wild about losing a node and having only 1 copy of my data available in prod. On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder n...@whistle.com wrote: Thanks Jonathan. So there is nothing too idiotic about my current set-up with 6 boxes each with 256 vnodes each and a RF of 2? I appreciate the help, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad j...@jonhaddad.com wrote: You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any
Re: Cassandra Files Taking up Much More Space than CF
Hi All, Thanks for the help but after yet another day of investigation I think I might be running into this https://issues.apache.org/jira/browse/CASSANDRA-8061 issue where tmplink files aren't removed until Cassandra is restarted. Thanks again for all the suggestions! Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 10:18 AM, Nate Yoder n...@whistle.com wrote: Hi Reynald, Good idea but I have incremental backups turned off and other than *.db files nothing else appears to be in the data directory for that table. Is there any other output that would be helpful in helping you all help me? Thanks, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 9:27 AM, Reynald Bourtembourg reynald.bourtembo...@esrf.fr wrote: Hi Nate, Are you using incremental backups? Extract from the documentation ( http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_backup_incremental_t.html ): *When incremental backups are enabled (disabled by default), Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. This allows storing backups offsite without transferring entire snapshots. Also, incremental backups combine with snapshots to provide a dependable, up-to-date backup mechanism.* *As with snapshots, Cassandra does not automatically clear incremental backup files. DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created.* These backups are stored in directories named backups at the same level as the snapshots' directories. Reynald On 09/12/2014 18:13, Nate Yoder wrote: Thanks for the advice. Totally makes sense. Once I figure out how to make my data stop taking up more than 2x more space without being useful I'll definitely make the change :) Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 9:02 AM, Jonathan Haddad j...@jonhaddad.com wrote: Well, I personally don't like RF=2. It means if you're using CL=QUORUM and a node goes down, you're going to have a bad time. (downtime) If you're using CL=ONE then you'd be ok. However, I am not wild about losing a node and having only 1 copy of my data available in prod. On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder n...@whistle.com wrote: Thanks Jonathan. So there is nothing too idiotic about my current set-up with 6 boxes each with 256 vnodes each and a RF of 2? I appreciate the help, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad j...@jonhaddad.com wrote: You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by
Best practice for emulating a Cassandra timeout during unit tests?
Hi all, I'd like to write some tests for my code that uses the Cassandra Java driver to see how it behaves if there is a read timeout while accessing Cassandra. Is there a best-practice for getting this done? I was thinking about adjusting the settings in the cluster builder to adjust the timeout settings to be something impossibly low (like 1ms), but I'd rather do something to my test Cassandra instance (using the EmbeddedCassandraService) to temporarily slow it down. Any suggestions? Best regards, Clint
Re: Cassandra Files Taking up Much More Space than CF
Thanks Rob. Definitely good advice that I wish I had come across a couple of months ago... That said, it still definitely points me in the right direction as to what to do now. -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 12:21 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Dec 8, 2014 at 5:12 PM, Nate Yoder n...@whistle.com wrote: I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. [... link to rather serious bug in 2.1.1 version in JIRA ...] https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ =Rob
upgrade cassandra from 2.0.6 to 2.1.2
I looked some upgrade documentations and am a little puzzled. According tohttps://github.com/apache/cassandra/blob/cassandra-2.1/NEWS.txt, “Rolling upgrades from anything pre-2.0.7 is not supported”. It means we should upgrade to 2.0.7 or later first? Can we rolling upgrade to 2.0.7? Do we need upgradestables after that? It seems nothing specific to note upgrading between 2.0.6 and 2.0.7 in NEWS.txt Any advice will be kindly appreciated
Cassandra Maintenance Best practices
Hi, We have Two Node Cluster Configuration in production with RF=2. Which means that the data is written in both the clusters and it's running for about a month now and has good amount of data. Questions? 1. What are the best practices for maintenance? 2. Is OPScenter required to be installed or I can manage with nodetool utility? 3. Is is necessary to run repair weekly? thanks regards Neha
[Cassandra][SStableLoader Out of Heap Memory]
Hi, Everyone: I'm importing a CSV file into Cassandra using SStableLoader. And I'm following the example here: https://github.com/yukim/cassandra-bulkload-example/ When i try to run the sstableloader, it fails with OOM. I also changed the sstableloader.sh script (that runs the java -cp ...BulkLoader ) to have more mem using -Xms and -Xmx args but still i keep hitting the same issue. Any hints/directions would be really helpful . *Stack Trace : * /usr/bin/sstableloader -v -d internal-ip /tmp/nitin_test/nitin_test_load/ Established connection to initial hosts Opening sstables and calculating sections to stream Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.ArrayList.init(ArrayList.java:144) at org.apache.cassandra.db.RowIndexEntry$Serializer. deserialize(RowIndexEntry.java:120) at org.apache.cassandra.io.sstable.SSTableReader.buildSummary(SSTableReader. java:457) at org.apache.cassandra.io.sstable.SSTableReader.openForBatch(SSTableReader. java:170) at org.apache.cassandra.io.sstable.SSTableLoader$1. accept(SSTableLoader.java:112) at java.io.File.list(File.java:1155) at org.apache.cassandra.io.sstable.SSTableLoader.openSSTables(SSTableLoader. java:73) at org.apache.cassandra.io.sstable.SSTableLoader.stream( SSTableLoader.java:155) at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:66) *Best Regards!* *Chao Yan--**My twitter:Andy Yan @yanchao727 https://twitter.com/yanchao727* *My Weibo:http://weibo.com/herewearenow http://weibo.com/herewearenow--*
Re: upgrade cassandra from 2.0.6 to 2.1.2
Yes. It is, in general, a best practice to upgrade to the latest bug fix release before doing an upgrade to the next point release. On Tue Dec 09 2014 at 6:58:24 PM wyang wy...@v5.cn wrote: I looked some upgrade documentations and am a little puzzled. According to https://github.com/apache/cassandra/blob/cassandra-2.1/NEWS.txt, “Rolling upgrades from anything pre-2.0.7 is not supported”. It means we should upgrade to 2.0.7 or later first? Can we rolling upgrade to 2.0.7? Do we need upgrade stables after that? It seems nothing specific to note upgrading between 2.0.6 and 2.0.7 in NEWS.txt Any advice will be kindly appreciated
Re: Cassandra Maintenance Best practices
I did a presentation on diagnosing performance problems in production at the US Euro summits, in which I covered quite a few tools preventative measures you should know when running a production cluster. You may find it useful: http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/ On ops center - I recommend it. It gives you a nice dashboard. I don't think it's completely comprehensive (but no tool really is) but it gets you 90% of the way there. It's a good idea to run repairs, especially if you're doing deletes or querying at CL=ONE. I assume you're not using quorum, because on RF=2 that's the same as CL=ALL. I recommend at least RF=3 because if you lose 1 server, you're on the edge of data loss. On Tue Dec 09 2014 at 7:19:32 PM Neha Trivedi nehajtriv...@gmail.com wrote: Hi, We have Two Node Cluster Configuration in production with RF=2. Which means that the data is written in both the clusters and it's running for about a month now and has good amount of data. Questions? 1. What are the best practices for maintenance? 2. Is OPScenter required to be installed or I can manage with nodetool utility? 3. Is is necessary to run repair weekly? thanks regards Neha