High latency on 5 node Cassandra Cluster
Hello. We had some major latency problems yesterday with our 5 node cassandra cluster. Wanted to get some feedback on where we could start to look to figure out what was causing the issue. If there is more info I should provide, please let me know. Here are the basics of the cluster: Clients: Hector and Cassie Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont) Replication Factor: 5 Quorum Reads and Writes enabled Read Repair set to true Cassandra Version: 1.0.12 We started experiencing catastrophic latency from our app servers. We believed at the time this was due to compactions running, and the clients were not re-routing appropriately, so we disabled thrift on a single node that had high load. This did not resolve the issue. After that, we stopped gossip on the same node that had high load on it, again this did not resolve anything. We then took down gossip on another node (leaving 3/5 up) and that fixed the latency from the application side. For a period of ~4 hours, every time we would try to bring up a fourth node, the app would see the latency again. We then rotated the three nodes that were up to make sure it was not a networking event related to a single region/provider and we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5 nodes would. After the ~4hours, we brought the cluster up to 5 nodes and everything was fine. We currently have some ideas on what caused this behavior, but has anyone else seen this type of problem where a full cluster causes problems, but removing nodes fixes it? Any input on what to look for in our logs to understand the issue? Thanks Arup
memtable mem usage off by 10?
Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before flushing 2.0.7, same but 1/4 and ~250mb In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then allowed cassandra to use up to ~400mb for memtables... I'm now running with 20480 for memtable_total_space_in_mb and cassandra is using ~2GB for memtables. Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any bug connected to this. java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster) BR Johan
Re: memtable mem usage off by 10?
If you are storing small values in your columns, the object overhead is very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you are measuring the memtable size by the resulting sstable size, you are not getting an accurate picture. This overhead has been reduced by about 90% in the upcoming 2.1 release, through tickets 6271 https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689 https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694 https://issues.apache.org/jira/browse/CASSANDRA-6694. On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote: Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before flushing 2.0.7, same but 1/4 and ~250mb In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then allowed cassandra to use up to ~400mb for memtables... I'm now running with 20480 for memtable_total_space_in_mb and cassandra is using ~2GB for memtables. Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any bug connected to this. java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster) BR Johan
RE: memtable mem usage off by 10?
I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing memory usage over 10GB for the heap, and the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used by cassandra, the rest in caches. Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 11:56 AM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? If you are storing small values in your columns, the object overhead is very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you are measuring the memtable size by the resulting sstable size, you are not getting an accurate picture. This overhead has been reduced by about 90% in the upcoming 2.1 release, through tickets 6271https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694https://issues.apache.org/jira/browse/CASSANDRA-6694. On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before flushing 2.0.7, same but 1/4 and ~250mb In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then allowed cassandra to use up to ~400mb for memtables... I'm now running with 20480 for memtable_total_space_in_mb and cassandra is using ~2GB for memtables. Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any bug connected to this. java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster) BR Johan
Re: memtable mem usage off by 10?
These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote: I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing memory usage over 10GB for the heap, and the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used by cassandra, the rest in caches. Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling. BR Johan -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 11:56 AM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? If you are storing small values in your columns, the object overhead is very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you are measuring the memtable size by the resulting sstable size, you are not getting an accurate picture. This overhead has been reduced by about 90% in the upcoming 2.1 release, through tickets 6271 https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689 https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694 https://issues.apache.org/jira/browse/CASSANDRA-6694. On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote: Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before flushing 2.0.7, same but 1/4 and ~250mb In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then allowed cassandra to use up to ~400mb for memtables... I'm now running with 20480 for memtable_total_space_in_mb and cassandra is using ~2GB for memtables. Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any bug connected to this. java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster) BR Johan
Re: High latency on 5 node Cassandra Cluster
I would first check to see if there was a time synchronization issue among nodes that triggered and/or perpetuated the event. ml On Wed, Jun 4, 2014 at 3:12 AM, Arup Chakrabarti a...@pagerduty.com wrote: Hello. We had some major latency problems yesterday with our 5 node cassandra cluster. Wanted to get some feedback on where we could start to look to figure out what was causing the issue. If there is more info I should provide, please let me know. Here are the basics of the cluster: Clients: Hector and Cassie Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont) Replication Factor: 5 Quorum Reads and Writes enabled Read Repair set to true Cassandra Version: 1.0.12 We started experiencing catastrophic latency from our app servers. We believed at the time this was due to compactions running, and the clients were not re-routing appropriately, so we disabled thrift on a single node that had high load. This did not resolve the issue. After that, we stopped gossip on the same node that had high load on it, again this did not resolve anything. We then took down gossip on another node (leaving 3/5 up) and that fixed the latency from the application side. For a period of ~4 hours, every time we would try to bring up a fourth node, the app would see the latency again. We then rotated the three nodes that were up to make sure it was not a networking event related to a single region/provider and we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5 nodes would. After the ~4hours, we brought the cluster up to 5 nodes and everything was fine. We currently have some ideas on what caused this behavior, but has anyone else seen this type of problem where a full cluster causes problems, but removing nodes fixes it? Any input on what to look for in our logs to understand the issue? Thanks Arup
RE: memtable mem usage off by 10?
Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course assuming that the overhead scales linearly with the amount of data in my table, we're using one table with three cells in this case. If it hardly increases at all, then I'll give up I guess :) At least until 2.1.0 comes out and I can compare. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 12:33 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing memory usage over 10GB for the heap, and the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used by cassandra, the rest in caches. Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 11:56 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? If you are storing small values in your columns, the object overhead is very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you are measuring the memtable size by the resulting sstable size, you are not getting an accurate picture. This overhead has been reduced by about 90% in the upcoming 2.1 release, through tickets 6271https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694https://issues.apache.org/jira/browse/CASSANDRA-6694. On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before flushing 2.0.7, same but 1/4 and ~250mb In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then allowed cassandra to use up to ~400mb for memtables... I'm now running with 20480 for memtable_total_space_in_mb and cassandra is using ~2GB for memtables. Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any bug connected to this. java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster) BR Johan
Re: memtable mem usage off by 10?
Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course assuming that the overhead scales linearly with the amount of data in my table, we're using one table with three cells in this case. If it hardly increases at all, then I'll give up I guess :) At least until 2.1.0 comes out and I can compare. BR Johan -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 12:33 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote: I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing memory usage over 10GB for the heap, and the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used by cassandra, the rest in caches. Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling. BR Johan -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 11:56 AM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? If you are storing small values in your columns, the object overhead is very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you are measuring the memtable size by the resulting sstable size, you are not getting an accurate picture. This overhead has been reduced by about 90% in the upcoming 2.1 release, through tickets 6271 https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689 https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694 https://issues.apache.org/jira/browse/CASSANDRA-6694. On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote: Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before flushing 2.0.7, same but 1/4 and ~250mb In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then allowed cassandra to use up to ~400mb for memtables... I'm now running with 20480 for memtable_total_space_in_mb and cassandra is using ~2GB for memtables. Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any bug connected to this. java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster) BR Johan
RE: memtable mem usage off by 10?
Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 1:19 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course assuming that the overhead scales linearly with the amount of data in my table, we're using one table with three cells in this case. If it hardly increases at all, then I'll give up I guess :) At least until 2.1.0 comes out and I can compare. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 12:33 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing memory usage over 10GB for the heap, and the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used by cassandra, the rest in caches. Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 11:56 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? If you are storing small values in your columns, the object overhead is very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you are measuring the memtable size by the resulting sstable size, you are not getting an accurate picture. This overhead has been reduced by about 90% in the upcoming 2.1 release, through tickets 6271https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694https://issues.apache.org/jira/browse/CASSANDRA-6694. On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb
Re: memtable mem usage off by 10?
I'm confused: there is no flush_largest_memtables_at property in C* 2.0? On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote: Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 1:19 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course assuming that the overhead scales linearly with the amount of data in my table, we're using one table with three cells in this case. If it hardly increases at all, then I'll give up I guess :) At least until 2.1.0 comes out and I can compare. BR Johan -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 12:33 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote: I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing memory usage over 10GB for the heap, and the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used by cassandra, the rest in caches. Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling. BR Johan -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 11:56 AM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? If you are storing small values in your columns, the object overhead is very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you are measuring the memtable size by the resulting sstable size, you are not getting an accurate picture. This overhead has been reduced by about 90% in the upcoming 2.1 release, through tickets 6271 https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689 https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694 https://issues.apache.org/jira/browse/CASSANDRA-6694. On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote: Hi, I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, basically it looks like it's using 10x less memory than it should based on the documentation and options. 10GB heap for both clusters. 1.2.x should use 1/3 of the heap
RE: memtable mem usage off by 10?
Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. If we skip that, part of my reply still makes sense: Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead of ~10x, as suggested, this would mean that it used 20GB, which is 2x the size of the heap. That shouldn't work. According to the OS, cassandra doesn't use more than ~11-12GB. From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 2:07 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? I'm confused: there is no flush_largest_memtables_at property in C* 2.0? On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 1:19 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course assuming that the overhead scales linearly with the amount of data in my table, we're using one table with three cells in this case. If it hardly increases at all, then I'll give up I guess :) At least until 2.1.0 comes out and I can compare. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 12:33 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing memory usage over 10GB for the heap, and the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used by cassandra, the rest in caches. Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 11:56 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re:
Re: Multi-DC Environment Question
Hello Matt, nodetool status: Datacenter: MAN === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns (effective) Host ID Token Rack UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239 -9211685935328163899 RAC1 UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d -3511707179720619260 RAC1 Datacenter: DER === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns (effective) Host ID Token Rack UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c -1277931707251349874 RAC1 UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726 -9204412570946850701 RAC1 I do not know why the cluster is not balanced at the moment, but it holds almost no data. I will populate it soon and see how that goes. The output of 'nodetool ring' just lists all the tokens assigned to each individual node, and as you can imagine it would be pointless to paste it here. I just did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024 as expected (4 nodes x 256 tokens each). Still have not got the answers to the other questions though... Thanks, Vasilis On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com wrote: Thanks Vasileios. I think I need to make a call as to whether to switch to vnodes or stick with tokens for my Multi-DC cluster. Would you be able to show a nodetool ring/status from your cluster to see what the token assignment looks like ? Thanks Matt On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: I should have said that earlier really... I am using 1.2.16 and Vnodes are enabled. Thanks, Vasilis -- Kind Regards, Vasileios Vlachos
Re: memtable mem usage off by 10?
Yeah, it is in the doc: http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html And I don’t find a Jira issue mentioning it being removed, so... what’s the full story there?! -- Jack Krupansky From: Idrén, Johan Sent: Wednesday, June 4, 2014 8:26 AM To: user@cassandra.apache.org Subject: RE: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. If we skip that, part of my reply still makes sense: Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead of ~10x, as suggested, this would mean that it used 20GB, which is 2x the size of the heap. That shouldn't work. According to the OS, cassandra doesn't use more than ~11-12GB. From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 2:07 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? I'm confused: there is no flush_largest_memtables_at property in C* 2.0? On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote: Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. -- From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 1:19 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course assuming that the overhead scales linearly with the amount of data in my table, we're using one table with three cells in this case. If it hardly increases at all, then I'll give up I guess :) At least until 2.1.0 comes out and I can compare. BR Johan From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 12:33 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote: I'm not measuring memtable size by looking at the sstables on disk, no. I'm looking through the JMX data. So I would believe (or hope) that I'm getting relevant data. If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to hit other problems, but I'm not seeing
Re: memtable mem usage off by 10?
Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. But something else is wrong, as Cassandra will crash if you supply an invalid property, implying it's not sourcing the config file you're using. I'm afraid I don't have the context for why it was removed, but it happened as part of the 2.0 release. On 4 June 2014 13:59, Jack Krupansky j...@basetechnology.com wrote: Yeah, it is in the doc: http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html And I don’t find a Jira issue mentioning it being removed, so... what’s the full story there?! -- Jack Krupansky *From:* Idrén, Johan johan.id...@dice.se *Sent:* Wednesday, June 4, 2014 8:26 AM *To:* user@cassandra.apache.org *Subject:* RE: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. If we skip that, part of my reply still makes sense: Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead of ~10x, as suggested, this would mean that it used 20GB, which is 2x the size of the heap. That shouldn't work. According to the OS, cassandra doesn't use more than ~11-12GB. -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 2:07 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? I'm confused: there is no flush_largest_memtables_at property in C* 2.0? On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote: Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 1:19 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course assuming that the overhead scales linearly with the amount of data in my table, we're using one table with three cells in this case. If it hardly increases at all, then I'll give up I guess :) At least until 2.1.0 comes out and I can compare. BR Johan -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 12:33 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? These measurements tell you the amount of user data stored in the memtables, not the amount of heap used to store it, so the same applies. On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote:
Re: memtable mem usage off by 10?
I wasn’t supplying it, I was assuming it was using the default. It does not exist in my config file. Sorry for the confusion. From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday 4 June 2014 16:36 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. But something else is wrong, as Cassandra will crash if you supply an invalid property, implying it's not sourcing the config file you're using. I'm afraid I don't have the context for why it was removed, but it happened as part of the 2.0 release. On 4 June 2014 13:59, Jack Krupansky j...@basetechnology.commailto:j...@basetechnology.com wrote: Yeah, it is in the doc: http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html And I don’t find a Jira issue mentioning it being removed, so... what’s the full story there?! -- Jack Krupansky From: Idrén, Johanmailto:johan.id...@dice.se Sent: Wednesday, June 4, 2014 8:26 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. If we skip that, part of my reply still makes sense: Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead of ~10x, as suggested, this would mean that it used 20GB, which is 2x the size of the heap. That shouldn't work. According to the OS, cassandra doesn't use more than ~11-12GB. From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 2:07 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? I'm confused: there is no flush_largest_memtables_at property in C* 2.0? On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. From: Benedict Elliott Smith belliottsm...@datastax.commailto:belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 1:19 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.semailto:johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x
Re: memtable mem usage off by 10?
And sorry that the doc confused you as well! -- Jack Krupansky From: Idrén, Johan Sent: Wednesday, June 4, 2014 10:51 AM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? I wasn’t supplying it, I was assuming it was using the default. It does not exist in my config file. Sorry for the confusion. From: Benedict Elliott Smith belliottsm...@datastax.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Wednesday 4 June 2014 16:36 To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. But something else is wrong, as Cassandra will crash if you supply an invalid property, implying it's not sourcing the config file you're using. I'm afraid I don't have the context for why it was removed, but it happened as part of the 2.0 release. On 4 June 2014 13:59, Jack Krupansky j...@basetechnology.com wrote: Yeah, it is in the doc: http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html And I don’t find a Jira issue mentioning it being removed, so... what’s the full story there?! -- Jack Krupansky From: Idrén, Johan Sent: Wednesday, June 4, 2014 8:26 AM To: user@cassandra.apache.org Subject: RE: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. If we skip that, part of my reply still makes sense: Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead of ~10x, as suggested, this would mean that it used 20GB, which is 2x the size of the heap. That shouldn't work. According to the OS, cassandra doesn't use more than ~11-12GB. -- From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 2:07 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? I'm confused: there is no flush_largest_memtables_at property in C* 2.0? On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote: Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. From: Benedict Elliott Smith belliottsm...@datastax.com Sent: Wednesday, June 4, 2014 1:19 PM To: user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max
Re: memtable mem usage off by 10?
In that case I would assume the problem is that for some reason JAMM is failing to load, and so the liveRatio it would ordinarily calculate is defaulting to 10 - are you using the bundled cassandra launch scripts? On 4 June 2014 15:51, Idrén, Johan johan.id...@dice.se wrote: I wasn’t supplying it, I was assuming it was using the default. It does not exist in my config file. Sorry for the confusion. From: Benedict Elliott Smith belliottsm...@datastax.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Wednesday 4 June 2014 16:36 To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. But something else is wrong, as Cassandra will crash if you supply an invalid property, implying it's not sourcing the config file you're using. I'm afraid I don't have the context for why it was removed, but it happened as part of the 2.0 release. On 4 June 2014 13:59, Jack Krupansky j...@basetechnology.com wrote: Yeah, it is in the doc: http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html And I don’t find a Jira issue mentioning it being removed, so... what’s the full story there?! -- Jack Krupansky *From:* Idrén, Johan johan.id...@dice.se *Sent:* Wednesday, June 4, 2014 8:26 AM *To:* user@cassandra.apache.org *Subject:* RE: memtable mem usage off by 10? Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was going by the documentation. It claims that the property is around in 2.0. If we skip that, part of my reply still makes sense: Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead of ~10x, as suggested, this would mean that it used 20GB, which is 2x the size of the heap. That shouldn't work. According to the OS, cassandra doesn't use more than ~11-12GB. -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 2:07 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? I'm confused: there is no flush_largest_memtables_at property in C* 2.0? On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote: Ok, so the overhead is a constant modifier, right. The 3x I arrived at with the following assumptions: heap is 10GB Default memory for memtable usage is 1/4 of heap in c* 2.0 max memory used for memtables is 2,5GB (10/4) flush_largest_memtables_at is 0.75 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the default) With an overhead of 10x, it makes sense that my memtable is flushed when the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 the heap So I would expect to see memtables flushed to disk after they're being reportedly at around 750MB. Having memtable_total_size_in_mb set to 20480, memtables are flushed at a reported value of ~2GB. With a constant overhead, this would mean that it used 20GB, which is 2x the size of the heap, instead of 3/4 of the heap as it should be if flush_largest_memtables_at was being respected. This shouldn't be possible. -- *From:* Benedict Elliott Smith belliottsm...@datastax.com *Sent:* Wednesday, June 4, 2014 1:19 PM *To:* user@cassandra.apache.org *Subject:* Re: memtable mem usage off by 10? Unfortunately it looks like the heap utilisation of memtables was not exposed in earlier versions, because they only maintained an estimate. The overhead scales linearly with the amount of data in your memtables (assuming the size of each cell is approx. constant). flush_largest_memtables_at is an independent setting to memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm afraid I don't follow where your 3x comes from? On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote: Aha, ok. Thanks. Trying to understand what my cluster is doing: cassandra.db.memtable_data_size only gets me the actual data but not the memtable heap memory usage. Is there a way to check for heap memory usage? I would expect to hit the flush_largest_memtables_at value, and this would be what causes the memtable flush to sstable then? By default 0.75? Then I would expect the amount of memory to be used to be maximum ~3x of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb). This is of course
Re: migration to a new model
OK Marcelo, I'll work on it today. -ml On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Michael, For sure I would be interested in this program! I am new both to python and for cql. I started creating this copier, but was having problems with timeouts. Alex solved my problem here on the list, but I think I will still have a lot of trouble making the copy to work fine. I open sourced my version here: https://github.com/s1mbi0se/cql_record_processor Just in case it's useful for anything. However, I saw CQL has support for concurrency itself and having something made by someone who knows Python CQL Driver better would be very helpful. My two servers today are at OVH (ovh.com), we have servers at AWS but but several cases we prefer other hosts. Both servers have SDD and 64 Gb RAM, I could use the script as a benchmark for you if you want. Besides, we have some bigger clusters, I could run on the just to test the speed if this is going to help. Regards Marcelo. 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com: Hi Marcelo, I could create a fast copy program by repurposing some python apps that I am using for benchmarking the python driver - do you still need this? With high levels of concurrency and multiple subprocess workers, based on my current actual benchmarks, I think I can get well over 1,000 rows/second on my mac and significantly more in AWS. I'm using variable size rows averaging 5kb. This would be the initial version of a piece of the benchmark suite we will release as part of our nyt⨍aбrik project on 21 June for my Cassandra Day NYC talk re the python driver. ml On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jens, Thanks for trying to help. Indeed, I know I can't do it using just CQL. But what would you use to migrate data manually? I tried to create a python program using auto paging, but I am getting timeouts. I also tried Hive, but no success. I only have two nodes and less than 200Gb in this cluster, any simple way to extract the data quickly would be good enough for me. Best regards, Marcelo. 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se: Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax Community Edition. What would be the best way to convert data from one CF to the other? Best regards, Marcelo.
Customized Compaction Strategy: Dev Questions
Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Re: Customized Compaction Strategy: Dev Questions
You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Cassandra 2.0 unbalanced ring with vnodes after adding new node
Hello to everyone! Please, can someone explain where we made a mistake? We have cluster with 4 nodes which uses vnodes(256 per node, default settings), snitch is default on every node: SimpleSnitch. These four nodes was from beginning of a cluster. In this cluster we have keyspace with this options: Keyspace: K: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:3] All was normal and nodetool status K shows that each node owns 75% of all key range. All 4 nodes are located in same datacenter and have same first two bytes in IP address(others are different). Then we buy new server on different datacenter and add it to the cluster with same settings as in previous four nodes(difference only in listen_address), assuming that the effective own of each node for this keyspace will be 300/5=60% or near. But after 3-5 minutes after start nodetool status K show this: nodetool status K; Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens Owns (effective) Host ID Rack UN N1 6,06 GB256 50.0% 62f295b3-0da6-4854-a53a-f03d6b424b03 rack1 UN N2 5,89 GB256 50.0% af4e4a23-2610-44dd-9061-09c7a6512a54 rack1 UN N3 6,02 GB256 50.0% 0f0e4e78-6fb2-479f-ad76-477006f76795 rack1 UN N4 5,8 GB 256 50.0% 670344c0-9856-48cf-9ec9-1a98f9a89460 rack1 UN N5 7,51 GB256 100.0% 82473d14-9e36-4ae7-86d2-a3e526efb53f rack1 N5 is newly added node nodetool repair -pr on N5 doesn't change anything nodetool describering K shows that new node N5 participate in EACH range. This is not we want at all. It looks like cassandra add new node to each range because it located in different datacenter, but all settings and output are exactly prevent this. Also interesting point is that while in all config files snitch is defined as SimpleSnitch the output of the command nodetool describecluster is: Cluster Information: Name: Some Cluster Name Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 26b8fa37-e666-31ed-aa3b-85be75f2aa1a: [N1, N2, N3, N4, N5] We use Cassandra 2.0.6 Questions we have at this moment: 1. How to rebalance ring so all nodes will own 60% of range? 1a. Removing node from cluster and adding it again is a solution? 2. Where we possibly make a mistake when adding new node? 3. If we add new 6th node to ring it will take 50% from N5 or some portion from each node? Thanks in advance! -- С уважением, Владимир Рудев (With regards, Vladimir Rudev) vladimir.ru...@gmail.com (mailto:vladimir.ru...@gmail.com)
unsubscribe
-- Data Architect ❘ Zephyr Health 589 Howard St. ❘ San Francisco, CA 94105 m: +1 9176477433 ❘ f: +1 415 520-9288 o: +1 415 529-7649 | s: raj.janakarajan http://www.zephyrhealth.com
Re: Customized Compaction Strategy: Dev Questions
Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Re: Customized Compaction Strategy: Dev Questions
hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Re: Customized Compaction Strategy: Dev Questions
Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Re: Customized Compaction Strategy: Dev Questions
I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop. You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Linux containers, docker, SSD, and RAID.
Hey guys. Question about using container with Cassandra. I think we will eventually deploy on containers… lxc with docker probably. Our first config will have one cassandra daemon per box. Of course there are issues here. Larger per VM heap means more GC time and potential stop the world and latency issues. And we also have to run SSD on RAID which is no fun. So I think what we're planning on doing is running with 32-64GB boxes, with 8-16GB of memory per container. If we have 4x SSDs on a box, then we can have each container have its own SSD, it's own memory, etc. One issue is the data placement. Obviously we don't want to put all the data on the same box… so I was thinking of telling it that each lxc is on the same rack. Right now there's data centers , and racks, which you have to focus on in terms of replica placement. But now there's one additional level… host. So I was thinking we could just have rack IDs be rack.host… or rack_host. This way cassandra knows not to place a replica on the same host but just in a different container. Thoughts? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Too Many Open Files (sockets) - VNodes - Map/Reduce Job
(this is probably a better question for the user list - cc/reply-to set) Allow more files to be open :) http://www.datastax.com/documentation/cassandra/1.2/cassandra/install/installRecommendSettings.html -- Kind regards, Michael On 06/04/2014 12:15 PM, Florian Dambrine wrote: Hi every body, We are running ElasticMapReduce Jobs from Amazon on a 25 nodes Cassandra cluster (with VNodes). Since we have increased the size of the cluster we are facing a too many open files (due to sockets) exception when creating the splits. Does anyone has an idea? Thanks, Here is the stacktrace: 14/06/04 03:23:24 INFO mapred.JobClient: Default number of map tasks: null 14/06/04 03:23:24 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 80 14/06/04 03:23:24 INFO mapred.JobClient: Default number of reduce tasks: 26 14/06/04 03:23:25 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache 14/06/04 03:23:25 INFO mapred.JobClient: Setting group to hadoop 14/06/04 03:23:41 ERROR transport.TSocket: Could not configure socket. java.net.SocketException: Too many open files at java.net.Socket.createImpl(Socket.java:447) at java.net.Socket.getImpl(Socket.java:510) at java.net.Socket.setSoLinger(Socket.java:984) at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:118) at org.apache.thrift.transport.TSocket.init(TSocket.java:109) at org.apache.thrift.transport.TSocket.init(TSocket.java:94) at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:39) at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:558) at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSubSplits(AbstractColumnFamilyInputFormat.java:286) at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.access$200(AbstractColumnFamilyInputFormat.java:61) at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:236) at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:221) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744)
Re: High latency on 5 node Cassandra Cluster
That is a pretty old version of Cassandra at this point. If you are using counters anywhere, you are probably seeing https://issues.apache.org/jira/browse/CASSANDRA-4578 which only shows up after you hit some arbitrary traffic threshold. If you don't want to upgrade (you really should), there was an update for the above in the 1.0 branch which was never released: https://github.com/apache/cassandra/blob/cassandra-1.0/CHANGES.txt#L2 On Wed, Jun 4, 2014 at 2:12 AM, Arup Chakrabarti a...@pagerduty.com wrote: Hello. We had some major latency problems yesterday with our 5 node cassandra cluster. Wanted to get some feedback on where we could start to look to figure out what was causing the issue. If there is more info I should provide, please let me know. Here are the basics of the cluster: Clients: Hector and Cassie Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont) Replication Factor: 5 Quorum Reads and Writes enabled Read Repair set to true Cassandra Version: 1.0.12 We started experiencing catastrophic latency from our app servers. We believed at the time this was due to compactions running, and the clients were not re-routing appropriately, so we disabled thrift on a single node that had high load. This did not resolve the issue. After that, we stopped gossip on the same node that had high load on it, again this did not resolve anything. We then took down gossip on another node (leaving 3/5 up) and that fixed the latency from the application side. For a period of ~4 hours, every time we would try to bring up a fourth node, the app would see the latency again. We then rotated the three nodes that were up to make sure it was not a networking event related to a single region/provider and we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5 nodes would. After the ~4hours, we brought the cluster up to 5 nodes and everything was fine. We currently have some ideas on what caused this behavior, but has anyone else seen this type of problem where a full cluster causes problems, but removing nodes fixes it? Any input on what to look for in our logs to understand the issue? Thanks Arup -- - Nate McCall Austin, TX @zznate Co-Founder Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Re: Customized Compaction Strategy: Dev Questions
Let's say I run a major compaction every day, so that the oldest sstable contains only the data for January 1st. Assuming all the nodes are in-sync and have had at least one repair run before the table is dropped (so that all information for that time period is the same), wouldn't it be safe to assume that the same data would be dropped on all nodes? There might be a period when the compaction is running where different nodes might have an inconsistent view of just that days' data (in that some would have it and others would not), but the cluster would still function and become eventually consistent, correct? Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed with it? I wouldn't be concerned with individual rows and columns, and this is a write-only table, more or less--the only deletes that occur in the current system are to delete the old data. On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote: I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop.You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Re: Customized Compaction Strategy: Dev Questions
Maybe I’m misunderstanding something, but what makes you think that running a major compaction every day will cause they data from January 1st to exist in only one SSTable and not have data from other days in the SSTable as well? Are you talking about making a new compaction strategy that creates SSTables by day? On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote: Let's say I run a major compaction every day, so that the oldest sstable contains only the data for January 1st. Assuming all the nodes are in-sync and have had at least one repair run before the table is dropped (so that all information for that time period is the same), wouldn't it be safe to assume that the same data would be dropped on all nodes? There might be a period when the compaction is running where different nodes might have an inconsistent view of just that days' data (in that some would have it and others would not), but the cluster would still function and become eventually consistent, correct? Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed with it? I wouldn't be concerned with individual rows and columns, and this is a write-only table, more or less--the only deletes that occur in the current system are to delete the old data. On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote: I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop. You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Re: Customized Compaction Strategy: Dev Questions
Sorry, yes, that is what I was looking to do--i.e., create a TopologicalCompactionStrategy or similar. On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com wrote: Maybe I’m misunderstanding something, but what makes you think that running a major compaction every day will cause they data from January 1st to exist in only one SSTable and not have data from other days in the SSTable as well? Are you talking about making a new compaction strategy that creates SSTables by day? On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote: Let's say I run a major compaction every day, so that the oldest sstable contains only the data for January 1st. Assuming all the nodes are in-sync and have had at least one repair run before the table is dropped (so that all information for that time period is the same), wouldn't it be safe to assume that the same data would be dropped on all nodes? There might be a period when the compaction is running where different nodes might have an inconsistent view of just that days' data (in that some would have it and others would not), but the cluster would still function and become eventually consistent, correct? Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed with it? I wouldn't be concerned with individual rows and columns, and this is a write-only table, more or less--the only deletes that occur in the current system are to delete the old data. On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote: I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop.You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew
Re: Customized Compaction Strategy: Dev Questions
I'd suggest creating 1 table per day, and dropping the tables you don't need once you're done. On Wed, Jun 4, 2014 at 10:44 AM, Redmumba redmu...@gmail.com wrote: Sorry, yes, that is what I was looking to do--i.e., create a TopologicalCompactionStrategy or similar. On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com wrote: Maybe I’m misunderstanding something, but what makes you think that running a major compaction every day will cause they data from January 1st to exist in only one SSTable and not have data from other days in the SSTable as well? Are you talking about making a new compaction strategy that creates SSTables by day? On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote: Let's say I run a major compaction every day, so that the oldest sstable contains only the data for January 1st. Assuming all the nodes are in-sync and have had at least one repair run before the table is dropped (so that all information for that time period is the same), wouldn't it be safe to assume that the same data would be dropped on all nodes? There might be a period when the compaction is running where different nodes might have an inconsistent view of just that days' data (in that some would have it and others would not), but the cluster would still function and become eventually consistent, correct? Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed with it? I wouldn't be concerned with individual rows and columns, and this is a write-only table, more or less--the only deletes that occur in the current system are to delete the old data. On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote: I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop.You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Customized Compaction Strategy: Dev Questions
That still involves quite a bit of infrastructure work--it also means that to query the data, I would have to make N queries, one per table, to query for audit information (audit information is sorted by a key identifying the item, and then the date). I don't think this would yield any benefit (to me) over simply tombstoning the values or creating a secondary index on date and simply doing a DELETE, right? Is there something internally preventing me from implementing this as a separate Strategy? On Wed, Jun 4, 2014 at 10:47 AM, Jonathan Haddad j...@jonhaddad.com wrote: I'd suggest creating 1 table per day, and dropping the tables you don't need once you're done. On Wed, Jun 4, 2014 at 10:44 AM, Redmumba redmu...@gmail.com wrote: Sorry, yes, that is what I was looking to do--i.e., create a TopologicalCompactionStrategy or similar. On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com wrote: Maybe I’m misunderstanding something, but what makes you think that running a major compaction every day will cause they data from January 1st to exist in only one SSTable and not have data from other days in the SSTable as well? Are you talking about making a new compaction strategy that creates SSTables by day? On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote: Let's say I run a major compaction every day, so that the oldest sstable contains only the data for January 1st. Assuming all the nodes are in-sync and have had at least one repair run before the table is dropped (so that all information for that time period is the same), wouldn't it be safe to assume that the same data would be dropped on all nodes? There might be a period when the compaction is running where different nodes might have an inconsistent view of just that days' data (in that some would have it and others would not), but the cluster would still function and become eventually consistent, correct? Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed with it? I wouldn't be concerned with individual rows and columns, and this is a write-only table, more or less--the only deletes that occur in the current system are to delete the old data. On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote: I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop.You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60
Re: Customized Compaction Strategy: Dev Questions
Well, DELETE will not free up disk space until after GC grace has passed and the next major compaction has run. So in essence, if you need to free up space right away, then creating daily/monthly tables would be one way to go. Just remember to clear your snapshots after dropping though. On June 4, 2014 at 1:54:05 PM, Redmumba (redmu...@gmail.com) wrote: That still involves quite a bit of infrastructure work--it also means that to query the data, I would have to make N queries, one per table, to query for audit information (audit information is sorted by a key identifying the item, and then the date). I don't think this would yield any benefit (to me) over simply tombstoning the values or creating a secondary index on date and simply doing a DELETE, right? Is there something internally preventing me from implementing this as a separate Strategy? On Wed, Jun 4, 2014 at 10:47 AM, Jonathan Haddad j...@jonhaddad.com wrote: I'd suggest creating 1 table per day, and dropping the tables you don't need once you're done. On Wed, Jun 4, 2014 at 10:44 AM, Redmumba redmu...@gmail.com wrote: Sorry, yes, that is what I was looking to do--i.e., create a TopologicalCompactionStrategy or similar. On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com wrote: Maybe I’m misunderstanding something, but what makes you think that running a major compaction every day will cause they data from January 1st to exist in only one SSTable and not have data from other days in the SSTable as well? Are you talking about making a new compaction strategy that creates SSTables by day? On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote: Let's say I run a major compaction every day, so that the oldest sstable contains only the data for January 1st. Assuming all the nodes are in-sync and have had at least one repair run before the table is dropped (so that all information for that time period is the same), wouldn't it be safe to assume that the same data would be dropped on all nodes? There might be a period when the compaction is running where different nodes might have an inconsistent view of just that days' data (in that some would have it and others would not), but the cluster would still function and become eventually consistent, correct? Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed with it? I wouldn't be concerned with individual rows and columns, and this is a write-only table, more or less--the only deletes that occur in the current system are to delete the old data. On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote: I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop. You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to
Re: High latency on 5 node Cassandra Cluster
On Wed, Jun 4, 2014 at 12:12 AM, Arup Chakrabarti a...@pagerduty.com wrote: Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont) Replication Factor: 5 You're operating with a single-DC strategy across multiple data centers? If so, I'm surprised you get sane latency ever. (Or do you mean RF : 2,2,1?) I agree with others that problems which can cause cluster wide outages exist in Gossip in the version of Cassandra you are running. As a general piece of feedback, I suggest an upgrade, first to 1.1 HEAD, then 1.2.16. =Rob
Re: New node Unable to gossip with any seeds
This generally means that how you are describing the see nodes address doesn't match how it's described in the second node seeds list in the correct way. CASSANDRA-6523 has some links that might be helpful. On 05/26/2014 12:07 AM, Tim Dunphy wrote: Hello, I am trying to spin up a new node using cassandra 2.0.7. Both nodes are at Digital Ocean. The seed node is up and running and I can telnet to port 7000 on that host from the node I'm trying to start. [root@cassandra02 apache-cassandra-2.0.7]# telnet 10.10.1.94 7000 Trying 10.10.1.94... Connected to 10.10.1.94. Escape character is '^]'. But when I start cassandra on the new node I see the following exception: INFO 00:01:34,744 Handshaking version with /10.10.1.94 ERROR 00:02:05,733 Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:505) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569) java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:505) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569) Exception encountered during startup: Unable to gossip with any seeds ERROR 00:02:05,742 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1270) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:573) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) I'm using the murmur3 partition on both nodes and I have the seed node's IP listed in the cassandra.yaml of the new node. I'm just wondering what the issue might be and how I can get around it. Thanks Tim
Re: alternative vnode upgrade strategy?
On 05/28/2014 02:18 PM, William Oberman wrote: 1.) Upgrade all N nodes to vnodes in place Start loop 2.) Boot a new node and let it bootstrap 3.) Decommission an old node End loop I's been a while since I had to think about the vnode migration, but I've think this would fall pray to https://issues.apache.org/jira/browse/CASSANDRA-5525
Re: Number of rows under one partition key
On Wed, Jun 4, 2014 at 12:39 PM, Chris Burroughs chris.burrou...@gmail.com wrote: https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ Although by the simplistic version count hueirstic the sheer quantity of releases in the 2.0.x line would now satisfy the constraint. Yes, I was specific about the 2.0 line instead of pasting that post because 2.0 has shown itself to be slightly worse than the average major release. To answer Paulo's question, it is these serious-class bugs in 2.0 line. I have yet to hear of a point release of 2.0.x which does not contain bugs I consider prohibitive for production use, though I have high hopes for 2.0.9. =Rob
Snapshot the data with 3 node and replicationfactor=3
Is there any reason you would like to take snapshot of column family on each node when cluster consists of 3 nodes with keyspace on replication factor =3? I am thinking of taking snapshot of CF on only one node. For restore, I will follow below 1. drop and recreate the CF on node1 2. copy snapshotted files to node 1 data directory of CF 3. perform nodetool refresh on node 1 Any suggestions/advise? ng
Re: problem removing dead node from ring
On Tue, Jun 3, 2014 at 9:03 PM, Matthew Allen matthew.j.al...@gmail.com wrote: Thanks Robert, this makes perfect sense. Do you know if CASSANDRA-6961 will be ported to 1.2.x ? I just asked driftx, he said not gonna happen. And apologies if these appear to be dumb questions, but is a repair more suitable than a rebuild because the rebuild only contacts 1 replica (per range), which may itself contain stale data ? Exactly that. https://issues.apache.org/jira/browse/CASSANDRA-2434 Discusses related issues in quite some detail. The tl;dr is that until 2434 is resolved, streams do not necessarily come from the node departing the range, and therefore the unique replica count is decreased by changing cluster topology. =Rob
Re: Snapshot the data with 3 node and replicationfactor=3
On Wed, Jun 4, 2014 at 1:26 PM, ng pipeli...@gmail.com wrote: Is there any reason you would like to take snapshot of column family on each node when cluster consists of 3 nodes with keyspace on replication factor =3? Unless all read/write occurs with CL.ALL (which is an availability problem), there is a nonzero chance of any given write not being on any given node at any given time. =Rob
Re: Snapshot the data with 3 node and replicationfactor=3
I am not worried about eventually consistent data. I just wanted to get rough data in close proximate. ng On Wed, Jun 4, 2014 at 2:49 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jun 4, 2014 at 1:26 PM, ng pipeli...@gmail.com wrote: Is there any reason you would like to take snapshot of column family on each node when cluster consists of 3 nodes with keyspace on replication factor =3? Unless all read/write occurs with CL.ALL (which is an availability problem), there is a nonzero chance of any given write not being on any given node at any given time. =Rob
nodetool move seems slow
Hello, We have a 5-node cluster runing cassandra 1.2.16, with a significant amount of data: AddressRackStatus State LoadOwns Token 6783174585269344219 10.198.xx.xx1 rack1 Up Normal 2.59 TB 60.00% -9223372036854775808 10.198.xx.xx2 rack1 Up Normal 1.49 TB 40.00% -5534023222112865485 10.198.xx.xx3 rack1 Up Normal 2.18 TB 53.23% -1844674407370955162 10.198.xx.xx4 rack1 Up Normal 2.86 TB 80.00% 5534023222112865484 10.198.xx.xx5 rack1 Up Moving 2.32 TB 66.77% 6783174585269344219 The first three nodes (.xx1 - .xx3 above) were at the desired tokens, so I issued a move on .xx4: nodetool move 1844674407370955161 That was about 40hrs ago! When I do nodetool netstats, I do see apparent progress: jatyler@xx4:~$ nodetool netstats Mode: MOVING Not sending any streams. Streaming from: /10.198.xx.xx2 SyncCore: /var/cassandra/data/SyncCore/file-ic-31475-Data.db sections=1 progress=0/77699597 - 0% … SyncCore: /var/cassandra/data/SyncCore/anotherFile-ic-32252-Data.db sections=1 progress=0/1254063427 - 0% Read Repair Statistics: Attempted: 8047367 Mismatch (Blocking): 97327 Mismatch (Background): 74369 Pool NameActive Pending Completed Commandsn/a 0 472255111 Responses n/a 1 749751322 I wrote 'apparent progress' because it reports “MOVING” and the Pending Commands/Responses are changing over time. However, I haven’t seen the individual .db files progress go above 0%. Meanwhile, the system appears to have plenty of unused bandwidth, from 'iostat -x -m 1': Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sda 0.0056.00 1338.00 171.0057.59 0.8979.36 0.570.38 0.17 25.30 avg-cpu: %user %nice %system %iowait %steal %idle 22.771.822.350.200.00 72.86 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 785.000.0033.80 0.0088.17 0.270.35 0.18 14.10 avg-cpu: %user %nice %system %iowait %steal %idle 20.162.052.220.200.00 75.37 Is 40 hours too long for this move? Should I be seeing individual .db files report more progress? Should I start with the first box (even though the token appears correct)? Any thoughts would be greatly appreciated. THX Cheers, ~Jason ***
Re: Consolidating records and TTL
Just use an atomic batch that holds both the insert and deletes: http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2 On Tue, Jun 3, 2014 at 2:13 PM, Charlie Mason charlie@gmail.com wrote: Hi All. I have a system thats going to make possibly several concurrent changes to a running total. I know I could use a counter for this. However I have extra meta data I can store with the changes which would allow me to reply the changes. If I use a counter and it looses some writes I can't recover it as I will only have its current total not the extra meta data to know where to replay from. What I was planning to do was write each change of the value to a CQL table with a Time UUID as a row level primary key as well as a partition key. Then when I need to read the running total back I will do a query for all the changes and add them up to get the total. As there could be tens of thousands of these I want to have a period after which these are consolidated. Most won't be any where near that but a few will which I need to be able to support. So I was also going to have a consolidated total table which holds the UUID of the values consolidated up to. Since I can bound the query for the recent updates by the UUID I should be able to avoid all the tombstones. So if the read encounters any changes that can be consolidated it inserts a new consolidated value and deletes the newly consolidated changes. What I am slightly worried about is what happens if the consolidated value insert fails but the deletes to the change records succeed. I would be left with an inconsistent total indefinitely. I have come up with a couple of ideas: 1, I could make it require all nodes to acknowledge it before deleting the difference records. 2, May be I could have another period after its consolidated but before its deleted? 3, Is there anyway I could use the TTL to allow to it to be deleted after a period of time? Chances are another read would come in and fix the value. Anyone got any other suggestions on how I could implement this? Thanks, Charlie M -- Tyler Hobbs DataStax http://datastax.com/
Re: nodetool move seems slow
On Wed, Jun 4, 2014 at 2:34 PM, Jason Tyler jaty...@yahoo-inc.com wrote: I wrote 'apparent progress' because it reports “MOVING” and the Pending Commands/Responses are changing over time. However, I haven’t seen the individual .db files progress go above 0%. Your move is hung. Restart the affected nodes [1] and then restart the move. =Rob [1] https://issues.apache.org/jira/browse/CASSANDRA-3486
Re: migration to a new model
BTW you might want to put a LIMIT clause on your SELECT for testing. -ml On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael michael.la...@nytimes.com wrote: Marcelo, Here is a link to the preview of the python fast copy program: https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 It will copy a table from one cluster to another with some transformation- they can be the same cluster. It has 3 main throttles to experiment with: 1. fetch_size: size of source pages in rows 2. worker_count: number of worker subprocesses 3. concurrency: number of async callback chains per worker subprocess It is easy to overrun Cassandra and the python driver, so I recommend starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency: 10. Additionally there are switches to set 'policies' by source and destination: retry (downgrade consistency), dc_aware, and token_aware. retry is useful if you are getting timeouts. For the others YMMV. To use it you need to define the SELECT and UPDATE cql statements as well as the 'map_fields' method. The worker subprocesses divide up the token range among themselves and proceed quasi-independently. Each worker opens a connection to each cluster and the driver sets up connection pools to the nodes in the cluster. Anyway there are a lot of processes, threads, callbacks going at once so it is fun to watch. On my regional cluster of small nodes in AWS I got about 3000 rows per second transferred after things warmed up a bit - each row about 6kb. ml On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael michael.la...@nytimes.com wrote: OK Marcelo, I'll work on it today. -ml On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Michael, For sure I would be interested in this program! I am new both to python and for cql. I started creating this copier, but was having problems with timeouts. Alex solved my problem here on the list, but I think I will still have a lot of trouble making the copy to work fine. I open sourced my version here: https://github.com/s1mbi0se/cql_record_processor Just in case it's useful for anything. However, I saw CQL has support for concurrency itself and having something made by someone who knows Python CQL Driver better would be very helpful. My two servers today are at OVH (ovh.com), we have servers at AWS but but several cases we prefer other hosts. Both servers have SDD and 64 Gb RAM, I could use the script as a benchmark for you if you want. Besides, we have some bigger clusters, I could run on the just to test the speed if this is going to help. Regards Marcelo. 2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com: Hi Marcelo, I could create a fast copy program by repurposing some python apps that I am using for benchmarking the python driver - do you still need this? With high levels of concurrency and multiple subprocess workers, based on my current actual benchmarks, I think I can get well over 1,000 rows/second on my mac and significantly more in AWS. I'm using variable size rows averaging 5kb. This would be the initial version of a piece of the benchmark suite we will release as part of our nyt⨍aбrik project on 21 June for my Cassandra Day NYC talk re the python driver. ml On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jens, Thanks for trying to help. Indeed, I know I can't do it using just CQL. But what would you use to migrate data manually? I tried to create a python program using auto paging, but I am getting timeouts. I also tried Hive, but no success. I only have two nodes and less than 200Gb in this cluster, any simple way to extract the data quickly would be good enough for me. Best regards, Marcelo. 2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se: Hi Marcelo, Looks like you can't do this without migrating your data manually: https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql Cheers, Jens On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. I realized I created my column family with the wrong partition. Instead of: CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)) WITH caching=all; I used: CREATE TABLE IF NOT EXISTS entitylookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY (name, value, entity_id)) WITH caching=all; Now I need to migrate the data from the second CF to the first one. I am using Data Stax Community Edition. What would be the best way to convert data from one CF to the other? Best regards, Marcelo.