date:20140604

Hi,


I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, 
basically it looks like it's using 10x less memory than it should based on the 
documentation and options.


10GB heap for both clusters.

1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before 
flushing


2.0.7, same but 1/4 and ~250mb


In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then 
allowed cassandra to use up to ~400mb for memtables...


I'm now running with 20480 for memtable_total_space_in_mb and cassandra is 
using ~2GB for memtables.


Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any 
bug connected to this.

java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster)


BR

Johan

Re: memtable mem usage off by 10?

If you are storing small values in your columns, the object overhead is
very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so
if you are measuring the memtable size by the resulting sstable size, you
are not getting an accurate picture. This overhead has been reduced by
about 90% in the upcoming 2.1 release, through tickets 6271
https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689
https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694
https://issues.apache.org/jira/browse/CASSANDRA-6694.


On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote:

  Hi,


  I'm seeing some strange behavior of the memtables, both in 1.2.13 and
 2.0.7, basically it looks like it's using 10x less memory than it should
 based on the documentation and options.


  10GB heap for both clusters.

 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb
 before flushing


  2.0.7, same but 1/4 and ~250mb


  In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which
 then allowed cassandra to use up to ~400mb for memtables...


  I'm now running with 20480 for memtable_total_space_in_mb and cassandra
 is using ~2GB for memtables.


  Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA
 for any bug connected to this.

 java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster)


  BR

 Johan

RE: memtable mem usage off by 10?

I'm not measuring memtable size by looking at the sstables on disk, no. I'm 
looking through the JMX data. So I would believe (or hope) that I'm getting 
relevant data.


If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to 
hit other problems, but I'm not seeing memory usage over 10GB for the heap, and 
the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used 
by cassandra, the rest in caches.


Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling.


BR

Johan



From: Benedict Elliott Smith belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 11:56 AM
To: user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

If you are storing small values in your columns, the object overhead is very 
substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you 
are measuring the memtable size by the resulting sstable size, you are not 
getting an accurate picture. This overhead has been reduced by about 90% in the 
upcoming 2.1 release, through tickets 
6271https://issues.apache.org/jira/browse/CASSANDRA-6271, 
6689https://issues.apache.org/jira/browse/CASSANDRA-6689 and 
6694https://issues.apache.org/jira/browse/CASSANDRA-6694.


On 4 June 2014 10:49, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Hi,


I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, 
basically it looks like it's using 10x less memory than it should based on the 
documentation and options.


10GB heap for both clusters.

1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before 
flushing


2.0.7, same but 1/4 and ~250mb


In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then 
allowed cassandra to use up to ~400mb for memtables...


I'm now running with 20480 for memtable_total_space_in_mb and cassandra is 
using ~2GB for memtables.


Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any 
bug connected to this.

java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster)


BR

Johan

Re: memtable mem usage off by 10?

These measurements tell you the amount of user data stored in the
memtables, not the amount of heap used to store it, so the same applies.


On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote:

  I'm not measuring memtable size by looking at the sstables on disk, no.
 I'm looking through the JMX data. So I would believe (or hope) that I'm
 getting relevant data.


  If I have a heap of 10GB and set the memtable usage to 20GB, I would
 expect to hit other problems, but I'm not seeing memory usage over 10GB for
 the heap, and the machine (which has ~30gb of memory) is showing ~10GB
 free, with ~12GB used by cassandra, the rest in caches.


  Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not
 idling.


  BR

 Johan


  --
 *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 11:56 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

  If you are storing small values in your columns, the object overhead is
 very substantial. So what is 400Mb on disk may well be 4Gb in memtables, so
 if you are measuring the memtable size by the resulting sstable size, you
 are not getting an accurate picture. This overhead has been reduced by
 about 90% in the upcoming 2.1 release, through tickets 6271
 https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689
 https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694
 https://issues.apache.org/jira/browse/CASSANDRA-6694.


 On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote:

  Hi,


  I'm seeing some strange behavior of the memtables, both in 1.2.13 and
 2.0.7, basically it looks like it's using 10x less memory than it should
 based on the documentation and options.


  10GB heap for both clusters.

 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb
 before flushing


  2.0.7, same but 1/4 and ~250mb


  In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096,
 which then allowed cassandra to use up to ~400mb for memtables...


  I'm now running with 20480 for memtable_total_space_in_mb and cassandra
 is using ~2GB for memtables.


  Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA
 for any bug connected to this.

 java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster)


  BR

 Johan

Re: High latency on 5 node Cassandra Cluster

2014-06-04 Thread Laing, Michael

I would first check to see if there was a time synchronization issue among
nodes that triggered and/or perpetuated the event.

ml


On Wed, Jun 4, 2014 at 3:12 AM, Arup Chakrabarti a...@pagerduty.com wrote:

 Hello. We had some major latency problems yesterday with our 5 node
 cassandra cluster. Wanted to get some feedback on where we could start to
 look to figure out what was causing the issue. If there is more info I
 should provide, please let me know.

 Here are the basics of the cluster:
 Clients: Hector and Cassie
 Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont)
 Replication Factor: 5
 Quorum Reads and Writes enabled
 Read Repair set to true
 Cassandra Version: 1.0.12

 We started experiencing catastrophic latency from our app servers. We
 believed at the time this was due to compactions running, and the clients
 were not re-routing appropriately, so we disabled thrift on a single node
 that had high load. This did not resolve the issue. After that, we stopped
 gossip on the same node that had high load on it, again this did not
 resolve anything. We then took down gossip on another node (leaving 3/5 up)
 and that fixed the latency from the application side. For a period of ~4
 hours, every time we would try to bring up a fourth node, the app would see
 the latency again. We then rotated the three nodes that were up to make
 sure it was not a networking event related to a single region/provider and
 we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5
 nodes would. After the ~4hours, we brought the cluster up to 5 nodes and
 everything was fine.

 We currently have some ideas on what caused this behavior, but has anyone
 else seen this type of problem where a full cluster causes problems, but
 removing nodes fixes it? Any input on what to look for in our logs to
 understand the issue?

 Thanks

 Arup

RE: memtable mem usage off by 10?

Aha, ok. Thanks.


Trying to understand what my cluster is doing:


cassandra.db.memtable_data_size only gets me the actual data but not the 
memtable heap memory usage. Is there a way to check for heap memory usage?


I would expect to hit the flush_largest_memtables_at value, and this would be 
what causes the memtable flush to sstable then? By default 0.75?


Then I would expect the amount of memory to be used to be maximum ~3x of what I 
was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 
3/4 before a flush), instead of close to 10x (250mb vs 2gb).

This is of course assuming that the overhead scales linearly with the amount of 
data in my table, we're using one table with three cells in this case. If it 
hardly increases at all, then I'll give up I guess :)

At least until 2.1.0 comes out and I can compare.


BR

Johan



From: Benedict Elliott Smith belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 12:33 PM
To: user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

These measurements tell you the amount of user data stored in the memtables, 
not the amount of heap used to store it, so the same applies.


On 4 June 2014 11:04, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

I'm not measuring memtable size by looking at the sstables on disk, no. I'm 
looking through the JMX data. So I would believe (or hope) that I'm getting 
relevant data.


If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to 
hit other problems, but I'm not seeing memory usage over 10GB for the heap, and 
the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used 
by cassandra, the rest in caches.


Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling.


BR

Johan



From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 11:56 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

If you are storing small values in your columns, the object overhead is very 
substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you 
are measuring the memtable size by the resulting sstable size, you are not 
getting an accurate picture. This overhead has been reduced by about 90% in the 
upcoming 2.1 release, through tickets 
6271https://issues.apache.org/jira/browse/CASSANDRA-6271, 
6689https://issues.apache.org/jira/browse/CASSANDRA-6689 and 
6694https://issues.apache.org/jira/browse/CASSANDRA-6694.


On 4 June 2014 10:49, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Hi,


I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, 
basically it looks like it's using 10x less memory than it should based on the 
documentation and options.


10GB heap for both clusters.

1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb before 
flushing


2.0.7, same but 1/4 and ~250mb


In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096, which then 
allowed cassandra to use up to ~400mb for memtables...


I'm now running with 20480 for memtable_total_space_in_mb and cassandra is 
using ~2GB for memtables.


Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA for any 
bug connected to this.

java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster)


BR

Johan

Re: memtable mem usage off by 10?

Unfortunately it looks like the heap utilisation of memtables was not
exposed in earlier versions, because they only maintained an estimate.

The overhead scales linearly with the amount of data in your memtables
(assuming the size of each cell is approx. constant).

flush_largest_memtables_at is an independent setting to
memtable_total_space_in_mb, and generally has little effect. Ordinarily
sstable flushes are triggered by hitting the memtable_total_space_in_mb
limit. I'm afraid I don't follow where your 3x comes from?


On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote:

  Aha, ok. Thanks.


  Trying to understand what my cluster is doing:


  cassandra.db.memtable_data_size only gets me the actual data but not the
 memtable heap memory usage. Is there a way to check for heap memory usage?


  I would expect to hit the flush_largest_memtables_at value, and this
 would be what causes the memtable flush to sstable then? By default 0.75?


  Then I would expect the amount of memory to be used to be maximum ~3x of
 what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by
 default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb).


 This is of course assuming that the overhead scales linearly with the
 amount of data in my table, we're using one table with three cells in this
 case. If it hardly increases at all, then I'll give up I guess :)

 At least until 2.1.0 comes out and I can compare.


  BR

 Johan


  --
 *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 12:33 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

  These measurements tell you the amount of user data stored in the
 memtables, not the amount of heap used to store it, so the same applies.


 On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote:

  I'm not measuring memtable size by looking at the sstables on disk, no.
 I'm looking through the JMX data. So I would believe (or hope) that I'm
 getting relevant data.


  If I have a heap of 10GB and set the memtable usage to 20GB, I would
 expect to hit other problems, but I'm not seeing memory usage over 10GB for
 the heap, and the machine (which has ~30gb of memory) is showing ~10GB
 free, with ~12GB used by cassandra, the rest in caches.


  Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not
 idling.


  BR

 Johan


  --
 *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 11:56 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

   If you are storing small values in your columns, the object overhead
 is very substantial. So what is 400Mb on disk may well be 4Gb in memtables,
 so if you are measuring the memtable size by the resulting sstable size,
 you are not getting an accurate picture. This overhead has been reduced by
 about 90% in the upcoming 2.1 release, through tickets 6271
 https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689
 https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694
 https://issues.apache.org/jira/browse/CASSANDRA-6694.


 On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote:

  Hi,


  I'm seeing some strange behavior of the memtables, both in 1.2.13 and
 2.0.7, basically it looks like it's using 10x less memory than it should
 based on the documentation and options.


  10GB heap for both clusters.

 1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb
 before flushing


  2.0.7, same but 1/4 and ~250mb


  In the 2.0.7 cluster I set the memtable_total_space_in_mb to 4096,
 which then allowed cassandra to use up to ~400mb for memtables...


  I'm now running with 20480 for memtable_total_space_in_mb and
 cassandra is using ~2GB for memtables.


  Soo, off by 10 somewhere? Has anyone else seen this? Can't find a JIRA
 for any bug connected to this.

 java 1.7.0_55, JNA 4.1.0 (for the 2.0 cluster)


  BR

 Johan

RE: memtable mem usage off by 10?

Ok, so the overhead is a constant modifier, right.


The 3x I arrived at with the following assumptions:


heap is 10GB

Default memory for memtable usage is 1/4 of heap in c* 2.0

max memory used for memtables is 2,5GB (10/4)

flush_largest_memtables_at is 0.75

flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the 
default)


With an overhead of 10x, it makes sense that my memtable is flushed when the 
jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap


After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it 
should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 
the heap


So I would expect to see memtables flushed to disk after they're being 
reportedly at around 750MB.


Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB.


With a constant overhead, this would mean that it used 20GB, which is 2x the 
size of the heap, instead of 3/4 of the heap as it should be if 
flush_largest_memtables_at was being respected.


This shouldn't be possible.



From: Benedict Elliott Smith belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 1:19 PM
To: user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

Unfortunately it looks like the heap utilisation of memtables was not exposed 
in earlier versions, because they only maintained an estimate.

The overhead scales linearly with the amount of data in your memtables 
(assuming the size of each cell is approx. constant).

flush_largest_memtables_at is an independent setting to 
memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable 
flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm 
afraid I don't follow where your 3x comes from?


On 4 June 2014 12:04, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Aha, ok. Thanks.


Trying to understand what my cluster is doing:


cassandra.db.memtable_data_size only gets me the actual data but not the 
memtable heap memory usage. Is there a way to check for heap memory usage?


I would expect to hit the flush_largest_memtables_at value, and this would be 
what causes the memtable flush to sstable then? By default 0.75?


Then I would expect the amount of memory to be used to be maximum ~3x of what I 
was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 
3/4 before a flush), instead of close to 10x (250mb vs 2gb).

This is of course assuming that the overhead scales linearly with the amount of 
data in my table, we're using one table with three cells in this case. If it 
hardly increases at all, then I'll give up I guess :)

At least until 2.1.0 comes out and I can compare.


BR

Johan



From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 12:33 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

These measurements tell you the amount of user data stored in the memtables, 
not the amount of heap used to store it, so the same applies.


On 4 June 2014 11:04, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

I'm not measuring memtable size by looking at the sstables on disk, no. I'm 
looking through the JMX data. So I would believe (or hope) that I'm getting 
relevant data.


If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to 
hit other problems, but I'm not seeing memory usage over 10GB for the heap, and 
the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used 
by cassandra, the rest in caches.


Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling.


BR

Johan



From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 11:56 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

If you are storing small values in your columns, the object overhead is very 
substantial. So what is 400Mb on disk may well be 4Gb in memtables, so if you 
are measuring the memtable size by the resulting sstable size, you are not 
getting an accurate picture. This overhead has been reduced by about 90% in the 
upcoming 2.1 release, through tickets 
6271https://issues.apache.org/jira/browse/CASSANDRA-6271, 
6689https://issues.apache.org/jira/browse/CASSANDRA-6689 and 
6694https://issues.apache.org/jira/browse/CASSANDRA-6694.


On 4 June 2014 10:49, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Hi,


I'm seeing some strange behavior of the memtables, both in 1.2.13 and 2.0.7, 
basically it looks like it's using 10x less memory than it should based on the 
documentation and options.


10GB heap for both clusters.

1.2.x should use 1/3 of the heap for memtables, but it uses max ~300mb

Re: memtable mem usage off by 10?

I'm confused: there is no flush_largest_memtables_at property in C* 2.0?


On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote:

  Ok, so the overhead is a constant modifier, right.


  The 3x I arrived at with the following assumptions:


  heap is 10GB

 Default memory for memtable usage is 1/4 of heap in c* 2.0
  max memory used for memtables is 2,5GB (10/4)

 flush_largest_memtables_at is 0.75

 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the
 default)


  With an overhead of 10x, it makes sense that my memtable is flushed when
 the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap


  After I've set the memtable_total_size_in_mb to a value larger than
 7,5GB, it should still not go over 7,5GB on account of
 flush_largest_memtables_at, 3/4 the heap


  So I would expect to see memtables flushed to disk after they're being
 reportedly at around 750MB.


  Having memtable_total_size_in_mb set to 20480, memtables are flushed at
 a reported value of ~2GB.


  With a constant overhead, this would mean that it used 20GB, which is 2x
 the size of the heap, instead of 3/4 of the heap as it should be if
 flush_largest_memtables_at was being respected.


  This shouldn't be possible.


  --
 *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 1:19 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

  Unfortunately it looks like the heap utilisation of memtables was not
 exposed in earlier versions, because they only maintained an estimate.

  The overhead scales linearly with the amount of data in your memtables
 (assuming the size of each cell is approx. constant).

  flush_largest_memtables_at is an independent setting to
 memtable_total_space_in_mb, and generally has little effect. Ordinarily
 sstable flushes are triggered by hitting the memtable_total_space_in_mb
 limit. I'm afraid I don't follow where your 3x comes from?


 On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote:

  Aha, ok. Thanks.


  Trying to understand what my cluster is doing:


  cassandra.db.memtable_data_size only gets me the actual data but not
 the memtable heap memory usage. Is there a way to check for heap memory
 usage?


  I would expect to hit the flush_largest_memtables_at value, and this
 would be what causes the memtable flush to sstable then? By default 0.75?


  Then I would expect the amount of memory to be used to be maximum ~3x
 of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by
 default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb).


 This is of course assuming that the overhead scales linearly with the
 amount of data in my table, we're using one table with three cells in this
 case. If it hardly increases at all, then I'll give up I guess :)

 At least until 2.1.0 comes out and I can compare.


  BR

 Johan


  --
  *From:* Benedict Elliott Smith belliottsm...@datastax.com
  *Sent:* Wednesday, June 4, 2014 12:33 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

   These measurements tell you the amount of user data stored in the
 memtables, not the amount of heap used to store it, so the same applies.


 On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote:

  I'm not measuring memtable size by looking at the sstables on disk,
 no. I'm looking through the JMX data. So I would believe (or hope) that I'm
 getting relevant data.


  If I have a heap of 10GB and set the memtable usage to 20GB, I would
 expect to hit other problems, but I'm not seeing memory usage over 10GB for
 the heap, and the machine (which has ~30gb of memory) is showing ~10GB
 free, with ~12GB used by cassandra, the rest in caches.


  Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not
 idling.


  BR

 Johan


  --
 *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 11:56 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

   If you are storing small values in your columns, the object overhead
 is very substantial. So what is 400Mb on disk may well be 4Gb in memtables,
 so if you are measuring the memtable size by the resulting sstable size,
 you are not getting an accurate picture. This overhead has been reduced by
 about 90% in the upcoming 2.1 release, through tickets 6271
 https://issues.apache.org/jira/browse/CASSANDRA-6271, 6689
 https://issues.apache.org/jira/browse/CASSANDRA-6689 and 6694
 https://issues.apache.org/jira/browse/CASSANDRA-6694.


 On 4 June 2014 10:49, Idrén, Johan johan.id...@dice.se wrote:

  Hi,


  I'm seeing some strange behavior of the memtables, both in 1.2.13 and
 2.0.7, basically it looks like it's using 10x less memory than it should
 based on the documentation and options.


  10GB heap for both clusters.

 1.2.x should use 1/3 of the heap

RE: memtable mem usage off by 10?

Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was 
going by the documentation. It claims that the property is around in 2.0.


If we skip that, part of my reply still makes sense:


Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB.


With a constant overhead of ~10x, as suggested, this would mean that it used 
20GB, which is 2x the size of the heap.


That shouldn't work. According to the OS, cassandra doesn't use more than 
~11-12GB.



From: Benedict Elliott Smith belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 2:07 PM
To: user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

I'm confused: there is no flush_largest_memtables_at property in C* 2.0?


On 4 June 2014 12:55, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Ok, so the overhead is a constant modifier, right.


The 3x I arrived at with the following assumptions:


heap is 10GB

Default memory for memtable usage is 1/4 of heap in c* 2.0

max memory used for memtables is 2,5GB (10/4)

flush_largest_memtables_at is 0.75

flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the 
default)


With an overhead of 10x, it makes sense that my memtable is flushed when the 
jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap


After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it 
should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 
the heap


So I would expect to see memtables flushed to disk after they're being 
reportedly at around 750MB.


Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB.


With a constant overhead, this would mean that it used 20GB, which is 2x the 
size of the heap, instead of 3/4 of the heap as it should be if 
flush_largest_memtables_at was being respected.


This shouldn't be possible.



From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 1:19 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

Unfortunately it looks like the heap utilisation of memtables was not exposed 
in earlier versions, because they only maintained an estimate.

The overhead scales linearly with the amount of data in your memtables 
(assuming the size of each cell is approx. constant).

flush_largest_memtables_at is an independent setting to 
memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable 
flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm 
afraid I don't follow where your 3x comes from?


On 4 June 2014 12:04, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Aha, ok. Thanks.


Trying to understand what my cluster is doing:


cassandra.db.memtable_data_size only gets me the actual data but not the 
memtable heap memory usage. Is there a way to check for heap memory usage?


I would expect to hit the flush_largest_memtables_at value, and this would be 
what causes the memtable flush to sstable then? By default 0.75?


Then I would expect the amount of memory to be used to be maximum ~3x of what I 
was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 
3/4 before a flush), instead of close to 10x (250mb vs 2gb).

This is of course assuming that the overhead scales linearly with the amount of 
data in my table, we're using one table with three cells in this case. If it 
hardly increases at all, then I'll give up I guess :)

At least until 2.1.0 comes out and I can compare.


BR

Johan



From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 12:33 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

These measurements tell you the amount of user data stored in the memtables, 
not the amount of heap used to store it, so the same applies.


On 4 June 2014 11:04, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

I'm not measuring memtable size by looking at the sstables on disk, no. I'm 
looking through the JMX data. So I would believe (or hope) that I'm getting 
relevant data.


If I have a heap of 10GB and set the memtable usage to 20GB, I would expect to 
hit other problems, but I'm not seeing memory usage over 10GB for the heap, and 
the machine (which has ~30gb of memory) is showing ~10GB free, with ~12GB used 
by cassandra, the rest in caches.


Reading 8k rows/s, writing 2k rows/s on a 3 node cluster. So it's not idling.


BR

Johan



From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 11:56 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re:

Re: Multi-DC Environment Question

2014-06-04 Thread Vasileios Vlachos

Hello Matt,

nodetool status:

Datacenter: MAN
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239
-9211685935328163899 RAC1
UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d
-3511707179720619260 RAC1
Datacenter: DER
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c
-1277931707251349874 RAC1
UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726
-9204412570946850701 RAC1

I do not know why the cluster is not balanced at the moment, but it holds
almost no data. I will populate it soon and see how that goes. The output
of 'nodetool ring' just lists all the tokens assigned to each individual
node, and as you can imagine it would be pointless to paste it here. I just
did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024
as expected (4 nodes x 256 tokens each).

Still have not got the answers to the other questions though...

Thanks,

Vasilis


On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com
wrote:

 Thanks Vasileios.  I think I need to make a call as to whether to switch
 to vnodes or stick with tokens for my Multi-DC cluster.

 Would you be able to show a nodetool ring/status from your cluster to see
 what the token assignment looks like ?

 Thanks

 Matt


 On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

  I should have said that earlier really... I am using 1.2.16 and Vnodes
 are enabled.

 Thanks,

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos

Re: memtable mem usage off by 10?

2014-06-04 Thread Jack Krupansky

Yeah, it is in the doc:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html

And I don’t find a Jira issue mentioning it being removed, so... what’s the 
full story there?!

-- Jack Krupansky

From: Idrén, Johan 
Sent: Wednesday, June 4, 2014 8:26 AM
To: user@cassandra.apache.org 
Subject: RE: memtable mem usage off by 10?

Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was 
going by the documentation. It claims that the property is around in 2.0.




If we skip that, part of my reply still makes sense:




Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB. 




With a constant overhead of ~10x, as suggested, this would mean that it used 
20GB, which is 2x the size of the heap.




That shouldn't work. According to the OS, cassandra doesn't use more than 
~11-12GB.







From: Benedict Elliott Smith belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 2:07 PM
To: user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10? 

I'm confused: there is no flush_largest_memtables_at property in C* 2.0?



On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote:

  Ok, so the overhead is a constant modifier, right.




  The 3x I arrived at with the following assumptions:




  heap is 10GB


  Default memory for memtable usage is 1/4 of heap in c* 2.0


  max memory used for memtables is 2,5GB (10/4)
  flush_largest_memtables_at is 0.75


  flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the 
default)




  With an overhead of 10x, it makes sense that my memtable is flushed when the 
jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap




  After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it 
should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 
the heap




  So I would expect to see memtables flushed to disk after they're being 
reportedly at around 750MB.




  Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB. 




  With a constant overhead, this would mean that it used 20GB, which is 2x the 
size of the heap, instead of 3/4 of the heap as it should be if 
flush_largest_memtables_at was being respected.




  This shouldn't be possible.





--

  From: Benedict Elliott Smith belliottsm...@datastax.com

  Sent: Wednesday, June 4, 2014 1:19 PM 

  To: user@cassandra.apache.org
  Subject: Re: memtable mem usage off by 10?

  Unfortunately it looks like the heap utilisation of memtables was not exposed 
in earlier versions, because they only maintained an estimate. 

  The overhead scales linearly with the amount of data in your memtables 
(assuming the size of each cell is approx. constant). 

  flush_largest_memtables_at is an independent setting to 
memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable 
flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm 
afraid I don't follow where your 3x comes from?




  On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote:

Aha, ok. Thanks.




Trying to understand what my cluster is doing:




cassandra.db.memtable_data_size only gets me the actual data but not the 
memtable heap memory usage. Is there a way to check for heap memory usage?





I would expect to hit the flush_largest_memtables_at value, and this would 
be what causes the memtable flush to sstable then? By default 0.75?




Then I would expect the amount of memory to be used to be maximum ~3x of 
what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, 
max 3/4 before a flush), instead of close to 10x (250mb vs 2gb).


This is of course assuming that the overhead scales linearly with the 
amount of data in my table, we're using one table with three cells in this 
case. If it hardly increases at all, then I'll give up I guess :)

At least until 2.1.0 comes out and I can compare.




BR

Johan






From: Benedict Elliott Smith belliottsm...@datastax.com

Sent: Wednesday, June 4, 2014 12:33 PM 

To: user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

These measurements tell you the amount of user data stored in the 
memtables, not the amount of heap used to store it, so the same applies.



On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote:

  I'm not measuring memtable size by looking at the sstables on disk, no. 
I'm looking through the JMX data. So I would believe (or hope) that I'm getting 
relevant data. 




  If I have a heap of 10GB and set the memtable usage to 20GB, I would 
expect to hit other problems, but I'm not seeing

Re: memtable mem usage off by 10?


 Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I
 was going by the documentation. It claims that the property is around in
 2.0.

But something else is wrong, as Cassandra will crash if you supply an
invalid property, implying it's not sourcing the config file you're using.
I'm afraid I don't have the context for why it was removed, but it happened
as part of the 2.0 release.



On 4 June 2014 13:59, Jack Krupansky j...@basetechnology.com wrote:

   Yeah, it is in the doc:

 http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html

 And I don’t find a Jira issue mentioning it being removed, so... what’s
 the full story there?!

 -- Jack Krupansky

  *From:* Idrén, Johan johan.id...@dice.se
 *Sent:* Wednesday, June 4, 2014 8:26 AM
 *To:* user@cassandra.apache.org
 *Subject:* RE: memtable mem usage off by 10?


 Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I
 was going by the documentation. It claims that the property is around in
 2.0.



 If we skip that, part of my reply still makes sense:



 Having memtable_total_size_in_mb set to 20480, memtables are flushed at a
 reported value of ~2GB.



 With a constant overhead of ~10x, as suggested, this would mean that it
 used 20GB, which is 2x the size of the heap.



 That shouldn't work. According to the OS, cassandra doesn't use more than
 ~11-12GB.


  --
 *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 2:07 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

  I'm confused: there is no flush_largest_memtables_at property in C* 2.0?


 On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote:

  Ok, so the overhead is a constant modifier, right.



 The 3x I arrived at with the following assumptions:



 heap is 10GB

 Default memory for memtable usage is 1/4 of heap in c* 2.0
 max memory used for memtables is 2,5GB (10/4)

 flush_largest_memtables_at is 0.75

 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the
 default)



 With an overhead of 10x, it makes sense that my memtable is flushed when
 the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap



 After I've set the memtable_total_size_in_mb to a value larger than
 7,5GB, it should still not go over 7,5GB on account of
 flush_largest_memtables_at, 3/4 the heap



 So I would expect to see memtables flushed to disk after they're being
 reportedly at around 750MB.



 Having memtable_total_size_in_mb set to 20480, memtables are flushed at a
 reported value of ~2GB.



 With a constant overhead, this would mean that it used 20GB, which is 2x
 the size of the heap, instead of 3/4 of the heap as it should be if
 flush_largest_memtables_at was being respected.



 This shouldn't be possible.


  --
  *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 1:19 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

   Unfortunately it looks like the heap utilisation of memtables was not
 exposed in earlier versions, because they only maintained an estimate.

 The overhead scales linearly with the amount of data in your memtables
 (assuming the size of each cell is approx. constant).

 flush_largest_memtables_at is an independent setting to
 memtable_total_space_in_mb, and generally has little effect. Ordinarily
 sstable flushes are triggered by hitting the memtable_total_space_in_mb
 limit. I'm afraid I don't follow where your 3x comes from?


 On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote:

  Aha, ok. Thanks.



 Trying to understand what my cluster is doing:



 cassandra.db.memtable_data_size only gets me the actual data but not
 the memtable heap memory usage. Is there a way to check for heap memory
 usage?


 I would expect to hit the flush_largest_memtables_at value, and this
 would be what causes the memtable flush to sstable then? By default 0.75?


 Then I would expect the amount of memory to be used to be maximum ~3x of
 what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by
 default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb).


 This is of course assuming that the overhead scales linearly with the
 amount of data in my table, we're using one table with three cells in this
 case. If it hardly increases at all, then I'll give up I guess :)

 At least until 2.1.0 comes out and I can compare.


 BR

 Johan


  --
  *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 12:33 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

   These measurements tell you the amount of user data stored in the
 memtables, not the amount of heap used to store it, so the same applies.


 On 4 June 2014 11:04, Idrén, Johan johan.id...@dice.se wrote:

Re: memtable mem usage off by 10?

I wasn’t supplying it, I was assuming it was using the default. It does not 
exist in my config file. Sorry for the confusion.

From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday 4 June 2014 16:36
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was 
going by the documentation. It claims that the property is around in 2.0.

But something else is wrong, as Cassandra will crash if you supply an invalid 
property, implying it's not sourcing the config file you're using.

I'm afraid I don't have the context for why it was removed, but it happened as 
part of the 2.0 release.

On 4 June 2014 13:59, Jack Krupansky 
j...@basetechnology.commailto:j...@basetechnology.com wrote:
Yeah, it is in the doc:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html

And I don’t find a Jira issue mentioning it being removed, so... what’s the 
full story there?!

-- Jack Krupansky

From: Idrén, Johanmailto:johan.id...@dice.se
Sent: Wednesday, June 4, 2014 8:26 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: memtable mem usage off by 10?

Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was 
going by the documentation. It claims that the property is around in 2.0.

If we skip that, part of my reply still makes sense:

Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB.

With a constant overhead of ~10x, as suggested, this would mean that it used 
20GB, which is 2x the size of the heap.

That shouldn't work. According to the OS, cassandra doesn't use more than 
~11-12GB.

From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 2:07 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

I'm confused: there is no flush_largest_memtables_at property in C* 2.0?

On 4 June 2014 12:55, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Ok, so the overhead is a constant modifier, right.

The 3x I arrived at with the following assumptions:

heap is 10GB

Default memory for memtable usage is 1/4 of heap in c* 2.0

max memory used for memtables is 2,5GB (10/4)

flush_largest_memtables_at is 0.75

flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the 
default)

With an overhead of 10x, it makes sense that my memtable is flushed when the 
jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap

After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, it 
should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 
the heap

So I would expect to see memtables flushed to disk after they're being 
reportedly at around 750MB.

Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB.

With a constant overhead, this would mean that it used 20GB, which is 2x the 
size of the heap, instead of 3/4 of the heap as it should be if 
flush_largest_memtables_at was being respected.

This shouldn't be possible.

From: Benedict Elliott Smith 
belliottsm...@datastax.commailto:belliottsm...@datastax.com
Sent: Wednesday, June 4, 2014 1:19 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

Unfortunately it looks like the heap utilisation of memtables was not exposed 
in earlier versions, because they only maintained an estimate.

The overhead scales linearly with the amount of data in your memtables 
(assuming the size of each cell is approx. constant).

flush_largest_memtables_at is an independent setting to 
memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable 
flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm 
afraid I don't follow where your 3x comes from?

On 4 June 2014 12:04, Idrén, Johan 
johan.id...@dice.semailto:johan.id...@dice.se wrote:

Aha, ok. Thanks.

Trying to understand what my cluster is doing:

cassandra.db.memtable_data_size only gets me the actual data but not the 
memtable heap memory usage. Is there a way to check for heap memory usage?

I would expect to hit the flush_largest_memtables_at value, and this would be 
what causes the memtable flush to sstable then? By default 0.75?

Then I would expect the amount of memory to be used to be maximum ~3x of what I 
was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, max 
3/4 before a flush), instead of close to 10x

Re: memtable mem usage off by 10?

2014-06-04 Thread Jack Krupansky

And sorry that the doc confused you as well!

-- Jack Krupansky

From: Idrén, Johan 
Sent: Wednesday, June 4, 2014 10:51 AM
To: user@cassandra.apache.org 
Subject: Re: memtable mem usage off by 10?

I wasn’t supplying it, I was assuming it was using the default. It does not 
exist in my config file. Sorry for the confusion.

From: Benedict Elliott Smith belliottsm...@datastax.com
Reply-To: user@cassandra.apache.org user@cassandra.apache.org
Date: Wednesday 4 June 2014 16:36
To: user@cassandra.apache.org user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

  Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was 
going by the documentation. It claims that the property is around in 2.0.
But something else is wrong, as Cassandra will crash if you supply an invalid 
property, implying it's not sourcing the config file you're using.

I'm afraid I don't have the context for why it was removed, but it happened as 
part of the 2.0 release.

On 4 June 2014 13:59, Jack Krupansky j...@basetechnology.com wrote:

  Yeah, it is in the doc:

http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html

  And I don’t find a Jira issue mentioning it being removed, so... what’s the 
full story there?!

  -- Jack Krupansky

  From: Idrén, Johan 
  Sent: Wednesday, June 4, 2014 8:26 AM
  To: user@cassandra.apache.org 
  Subject: RE: memtable mem usage off by 10?

  Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I was 
going by the documentation. It claims that the property is around in 2.0.

  If we skip that, part of my reply still makes sense:

  Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB. 

  With a constant overhead of ~10x, as suggested, this would mean that it used 
20GB, which is 2x the size of the heap.

  That shouldn't work. According to the OS, cassandra doesn't use more than 
~11-12GB.

--

  From: Benedict Elliott Smith belliottsm...@datastax.com
  Sent: Wednesday, June 4, 2014 2:07 PM
  To: user@cassandra.apache.org
  Subject: Re: memtable mem usage off by 10? 

  I'm confused: there is no flush_largest_memtables_at property in C* 2.0?

  On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote:

Ok, so the overhead is a constant modifier, right.

The 3x I arrived at with the following assumptions:

heap is 10GB

Default memory for memtable usage is 1/4 of heap in c* 2.0

max memory used for memtables is 2,5GB (10/4)
flush_largest_memtables_at is 0.75

flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the 
default)

With an overhead of 10x, it makes sense that my memtable is flushed when 
the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap

After I've set the memtable_total_size_in_mb to a value larger than 7,5GB, 
it should still not go over 7,5GB on account of flush_largest_memtables_at, 3/4 
the heap

So I would expect to see memtables flushed to disk after they're being 
reportedly at around 750MB.

Having memtable_total_size_in_mb set to 20480, memtables are flushed at a 
reported value of ~2GB. 

With a constant overhead, this would mean that it used 20GB, which is 2x 
the size of the heap, instead of 3/4 of the heap as it should be if 
flush_largest_memtables_at was being respected.

This shouldn't be possible.

From: Benedict Elliott Smith belliottsm...@datastax.com

Sent: Wednesday, June 4, 2014 1:19 PM 

To: user@cassandra.apache.org
Subject: Re: memtable mem usage off by 10?

Unfortunately it looks like the heap utilisation of memtables was not 
exposed in earlier versions, because they only maintained an estimate. 

The overhead scales linearly with the amount of data in your memtables 
(assuming the size of each cell is approx. constant). 

flush_largest_memtables_at is an independent setting to 
memtable_total_space_in_mb, and generally has little effect. Ordinarily sstable 
flushes are triggered by hitting the memtable_total_space_in_mb limit. I'm 
afraid I don't follow where your 3x comes from?

On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote:

  Aha, ok. Thanks.

  Trying to understand what my cluster is doing:

  cassandra.db.memtable_data_size only gets me the actual data but not the 
memtable heap memory usage. Is there a way to check for heap memory usage?

  I would expect to hit the flush_largest_memtables_at value, and this 
would be what causes the memtable flush to sstable then? By default 0.75?

  Then I would expect the amount of memory to be used to be maximum ~3x of 
what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by default, 
max

Re: memtable mem usage off by 10?

In that case I would assume the problem is that for some reason JAMM is
failing to load, and so the liveRatio it would ordinarily calculate is
defaulting to 10 - are you using the bundled cassandra launch scripts?


On 4 June 2014 15:51, Idrén, Johan johan.id...@dice.se wrote:

  I wasn’t supplying it, I was assuming it was using the default. It does
 not exist in my config file. Sorry for the confusion.



   From: Benedict Elliott Smith belliottsm...@datastax.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Wednesday 4 June 2014 16:36
 To: user@cassandra.apache.org user@cassandra.apache.org

 Subject: Re: memtable mem usage off by 10?

Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry,
 I was going by the documentation. It claims that the property is around in
 2.0.

 But something else is wrong, as Cassandra will crash if you supply an
 invalid property, implying it's not sourcing the config file you're using.
  I'm afraid I don't have the context for why it was removed, but it
 happened as part of the 2.0 release.



 On 4 June 2014 13:59, Jack Krupansky j...@basetechnology.com wrote:

   Yeah, it is in the doc:

 http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html

 And I don’t find a Jira issue mentioning it being removed, so... what’s
 the full story there?!

 -- Jack Krupansky

  *From:* Idrén, Johan johan.id...@dice.se
 *Sent:* Wednesday, June 4, 2014 8:26 AM
 *To:* user@cassandra.apache.org
 *Subject:* RE: memtable mem usage off by 10?


 Oh, well ok that explains why I'm not seeing a flush at 750MB. Sorry, I
 was going by the documentation. It claims that the property is around in
 2.0.



 If we skip that, part of my reply still makes sense:



 Having memtable_total_size_in_mb set to 20480, memtables are flushed at a
 reported value of ~2GB.



 With a constant overhead of ~10x, as suggested, this would mean that it
 used 20GB, which is 2x the size of the heap.



 That shouldn't work. According to the OS, cassandra doesn't use more than
 ~11-12GB.


  --
 *From:* Benedict Elliott Smith belliottsm...@datastax.com
 *Sent:* Wednesday, June 4, 2014 2:07 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

  I'm confused: there is no flush_largest_memtables_at property in C* 2.0?


 On 4 June 2014 12:55, Idrén, Johan johan.id...@dice.se wrote:

  Ok, so the overhead is a constant modifier, right.



 The 3x I arrived at with the following assumptions:



 heap is 10GB

 Default memory for memtable usage is 1/4 of heap in c* 2.0
  max memory used for memtables is 2,5GB (10/4)

 flush_largest_memtables_at is 0.75

 flush largest memtables when memtables use 7,5GB (3/4 of heap, 3x of the
 default)



 With an overhead of 10x, it makes sense that my memtable is flushed when
 the jmx data says it is at ~250MB, ie 2,5GB, ie 1/4 of the heap



 After I've set the memtable_total_size_in_mb to a value larger than
 7,5GB, it should still not go over 7,5GB on account of
 flush_largest_memtables_at, 3/4 the heap



 So I would expect to see memtables flushed to disk after they're being
 reportedly at around 750MB.



 Having memtable_total_size_in_mb set to 20480, memtables are flushed at
 a reported value of ~2GB.



 With a constant overhead, this would mean that it used 20GB, which is 2x
 the size of the heap, instead of 3/4 of the heap as it should be if
 flush_largest_memtables_at was being respected.



 This shouldn't be possible.


  --
  *From:* Benedict Elliott Smith belliottsm...@datastax.com
  *Sent:* Wednesday, June 4, 2014 1:19 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: memtable mem usage off by 10?

   Unfortunately it looks like the heap utilisation of memtables was not
 exposed in earlier versions, because they only maintained an estimate.

 The overhead scales linearly with the amount of data in your memtables
 (assuming the size of each cell is approx. constant).

 flush_largest_memtables_at is an independent setting to
 memtable_total_space_in_mb, and generally has little effect. Ordinarily
 sstable flushes are triggered by hitting the memtable_total_space_in_mb
 limit. I'm afraid I don't follow where your 3x comes from?


 On 4 June 2014 12:04, Idrén, Johan johan.id...@dice.se wrote:

  Aha, ok. Thanks.



 Trying to understand what my cluster is doing:



 cassandra.db.memtable_data_size only gets me the actual data but not
 the memtable heap memory usage. Is there a way to check for heap memory
 usage?


  I would expect to hit the flush_largest_memtables_at value, and this
 would be what causes the memtable flush to sstable then? By default 0.75?


  Then I would expect the amount of memory to be used to be maximum ~3x
 of what I was seeing when I hadn't set memtable_total_space_in_mb (1/4 by
 default, max 3/4 before a flush), instead of close to 10x (250mb vs 2gb).


 This is of course

Re: migration to a new model

2014-06-04 Thread Laing, Michael

OK Marcelo, I'll work on it today. -ml

On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:

Hi Michael,

For sure I would be interested in this program!

I am new both to python and for cql. I started creating this copier, but
was having problems with timeouts. Alex solved my problem here on the list,
but I think I will still have a lot of trouble making the copy to work fine.

I open sourced my version here:
https://github.com/s1mbi0se/cql_record_processor

Just in case it's useful for anything.

However, I saw CQL has support for concurrency itself and having something
made by someone who knows Python CQL Driver better would be very helpful.

My two servers today are at OVH (ovh.com), we have servers at AWS but but
several cases we prefer other hosts. Both servers have SDD and 64 Gb RAM, I
could use the script as a benchmark for you if you want. Besides, we have
some bigger clusters, I could run on the just to test the speed if this is
going to help.

Regards
Marcelo.

2014-06-03 11:40 GMT-03:00 Laing, Michael michael.la...@nytimes.com:

Hi Marcelo,

I could create a fast copy program by repurposing some python apps that I
am using for benchmarking the python driver - do you still need this?

With high levels of concurrency and multiple subprocess workers, based on
my current actual benchmarks, I think I can get well over 1,000 rows/second
on my mac and significantly more in AWS. I'm using variable size rows
averaging 5kb.

This would be the initial version of a piece of the benchmark suite we
will release as part of our nyt⨍aбrik project on 21 June for my
Cassandra Day NYC talk re the python driver.

On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:

Hi Jens,

Thanks for trying to help.

Indeed, I know I can't do it using just CQL. But what would you use to
migrate data manually? I tried to create a python program using auto
paging, but I am getting timeouts. I also tried Hive, but no success.
I only have two nodes and less than 200Gb in this cluster, any simple
way to extract the data quickly would be good enough for me.

Best regards,
Marcelo.

2014-06-02 15:08 GMT-03:00 Jens Rantil jens.ran...@tink.se:

Hi Marcelo,

Looks like you can't do this without migrating your data manually:
https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql

Cheers,
Jens

On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:

Hi,

I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.

I realized I created my column family with the wrong partition.
Instead of:

CREATE TABLE IF NOT EXISTS entity_lookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY ((name, value), entity_id))
WITH
caching=all;

I used:

CREATE TABLE IF NOT EXISTS entitylookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY (name, value, entity_id))
WITH
caching=all;

Now I need to migrate the data from the second CF to the first one.
I am using Data Stax Community Edition.

What would be the best way to convert data from one CF to the other?

Best regards,
Marcelo.

Customized Compaction Strategy: Dev Questions

Good morning!

I've asked (and seen other people ask) about the ability to drop old
sstables, basically creating a FIFO-like clean-up process.  Since we're
using Cassandra as an auditing system, this is particularly appealing to us
because it means we can maximize the amount of auditing data we can keep
while still allowing Cassandra to clear old data automatically.

My idea is this: perform compaction based on the range of dates available
in the sstable (or just metadata about when it was created).  For example,
a major compaction could create a combined sstable per day--so that, say,
60 days of data after a major compaction would contain 60 sstables.

My question then is, will this be possible by simply implementing a
separate AbstractCompactionStrategy?  Does this sound feasilble at all?
Based on the implementation of Size and Leveled strategies, it looks like I
would have the ability to control what and how things get compacted, but I
wanted to verify before putting time into it.

Thank you so much for your time!

Andrew

Re: Customized Compaction Strategy: Dev Questions

You mean this:

https://issues.apache.org/jira/browse/CASSANDRA-5228

?



On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

Good morning!

I've asked (and seen other people ask) about the ability to drop old sstables, 
basically creating a FIFO-like clean-up process.  Since we're using Cassandra 
as an auditing system, this is particularly appealing to us because it means we 
can maximize the amount of auditing data we can keep while still allowing 
Cassandra to clear old data automatically.

My idea is this: perform compaction based on the range of dates available in 
the sstable (or just metadata about when it was created).  For example, a major 
compaction could create a combined sstable per day--so that, say, 60 days of 
data after a major compaction would contain 60 sstables.

My question then is, will this be possible by simply implementing a separate 
AbstractCompactionStrategy?  Does this sound feasilble at all?  Based on the 
implementation of Size and Leveled strategies, it looks like I would have the 
ability to control what and how things get compacted, but I wanted to verify 
before putting time into it.

Thank you so much for your time!

Andrew

Cassandra 2.0 unbalanced ring with vnodes after adding new node

2014-06-04 Thread Владимир Рудев

Hello to everyone!

Please, can someone explain where we made a mistake?

We have cluster with 4 nodes which uses vnodes(256 per node, default settings), 
snitch is default on every node: SimpleSnitch.  
These four nodes was from beginning of a cluster.
In this cluster we have keyspace with this options:
Keyspace: K:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Durable Writes: true
Options: [replication_factor:3]


All was normal and nodetool status K shows that each node owns 75% of all key 
range. All 4 nodes are located in same datacenter and have same first two bytes 
in IP address(others are different).  

Then we buy new server on different datacenter and add it to the cluster with 
same settings as in previous four nodes(difference only in listen_address), 
assuming that the effective own of each node for this keyspace will be 
300/5=60% or near. But after 3-5 minutes after start nodetool status K show 
this:
nodetool status K;
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens  Owns (effective)  Host ID 
  Rack
UN  N1   6,06 GB256 50.0% 
62f295b3-0da6-4854-a53a-f03d6b424b03  rack1
UN  N2   5,89 GB256 50.0% 
af4e4a23-2610-44dd-9061-09c7a6512a54  rack1
UN  N3   6,02 GB256 50.0% 
0f0e4e78-6fb2-479f-ad76-477006f76795  rack1
UN  N4   5,8 GB 256 50.0% 
670344c0-9856-48cf-9ec9-1a98f9a89460  rack1
UN  N5   7,51 GB256 100.0%
82473d14-9e36-4ae7-86d2-a3e526efb53f  rack1


N5 is newly added node

nodetool repair -pr on N5 doesn't change anything

nodetool describering K shows that new node N5 participate in EACH range. This 
is not we want at all.  

It looks like cassandra add new node to each range because it located in 
different datacenter, but all settings and output are exactly prevent this.

Also interesting point is that while in all config files snitch is defined as 
SimpleSnitch the output of the command nodetool describecluster is:
Cluster Information:
Name: Some Cluster Name
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
26b8fa37-e666-31ed-aa3b-85be75f2aa1a: [N1, N2, N3, N4, N5]


We use Cassandra 2.0.6

Questions we have at this moment:
1. How to rebalance ring so all nodes will own 60% of range?
   1a. Removing node from cluster and adding it again is a solution?
2. Where we possibly make a mistake when adding new node?
3. If we add new 6th node to ring it will take 50% from N5 or some portion from 
each node?

Thanks in advance!

--  
С уважением,  
Владимир Рудев
(With regards, Vladimir Rudev)
vladimir.ru...@gmail.com (mailto:vladimir.ru...@gmail.com)

unsubscribe

2014-06-04 Thread Raj Janakarajan

-- 

Data Architect ❘ Zephyr Health
589 Howard St. ❘ San Francisco, CA 94105
m: +1 9176477433 ❘ f: +1 415 520-9288
o: +1 415 529-7649 | s: raj.janakarajan

http://www.zephyrhealth.com

Re: Customized Compaction Strategy: Dev Questions

Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest
sstable rather than simply run out of space.

The problem with using TTLs is that I have to try and guess how much data
is being put in--since this is auditing data, the usage can vary wildly
depending on time of year, verbosity of auditing, etc..  I'd like to
maximize the disk space--not optimize the cleanup process.

Andrew


On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com
wrote:

 You mean this:

 https://issues.apache.org/jira/browse/CASSANDRA-5228

 ?



 On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

  Good morning!

 I've asked (and seen other people ask) about the ability to drop old
 sstables, basically creating a FIFO-like clean-up process.  Since we're
 using Cassandra as an auditing system, this is particularly appealing to us
 because it means we can maximize the amount of auditing data we can keep
 while still allowing Cassandra to clear old data automatically.

 My idea is this: perform compaction based on the range of dates available
 in the sstable (or just metadata about when it was created).  For example,
 a major compaction could create a combined sstable per day--so that, say,
 60 days of data after a major compaction would contain 60 sstables.

 My question then is, will this be possible by simply implementing a
 separate AbstractCompactionStrategy?  Does this sound feasilble at all?
 Based on the implementation of Size and Leveled strategies, it looks like I
 would have the ability to control what and how things get compacted, but I
 wanted to verify before putting time into it.

 Thank you so much for your time!

 Andrew

Re: Customized Compaction Strategy: Dev Questions

hmm, I see. So something similar to Capped Collections in MongoDB.



On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable 
rather than simply run out of space.

The problem with using TTLs is that I have to try and guess how much data is 
being put in--since this is auditing data, the usage can vary wildly depending 
on time of year, verbosity of auditing, etc..  I'd like to maximize the disk 
space--not optimize the cleanup process.

Andrew


On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote:
You mean this:

https://issues.apache.org/jira/browse/CASSANDRA-5228

?



On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

Good morning!

I've asked (and seen other people ask) about the ability to drop old sstables, 
basically creating a FIFO-like clean-up process.  Since we're using Cassandra 
as an auditing system, this is particularly appealing to us because it means we 
can maximize the amount of auditing data we can keep while still allowing 
Cassandra to clear old data automatically.

My idea is this: perform compaction based on the range of dates available in 
the sstable (or just metadata about when it was created).  For example, a major 
compaction could create a combined sstable per day--so that, say, 60 days of 
data after a major compaction would contain 60 sstables.

My question then is, will this be possible by simply implementing a separate 
AbstractCompactionStrategy?  Does this sound feasilble at all?  Based on the 
implementation of Size and Leveled strategies, it looks like I would have the 
ability to control what and how things get compacted, but I wanted to verify 
before putting time into it.

Thank you so much for your time!

Andrew

Re: Customized Compaction Strategy: Dev Questions

Thanks, Russell--yes, a similar concept, just applied to sstables.  I'm
assuming this would require changes to both major compactions, and probably
GC (to remove the old tables), but since I'm not super-familiar with the C*
internals, I wanted to make sure it was feasible with the current toolset
before I actually dived in and started tinkering.

Andrew


On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com
wrote:

 hmm, I see. So something similar to Capped Collections in MongoDB.



 On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

  Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest
 sstable rather than simply run out of space.

 The problem with using TTLs is that I have to try and guess how much data
 is being put in--since this is auditing data, the usage can vary wildly
 depending on time of year, verbosity of auditing, etc..  I'd like to
 maximize the disk space--not optimize the cleanup process.

 Andrew


 On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

  You mean this:

  https://issues.apache.org/jira/browse/CASSANDRA-5228

  ?



 On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

   Good morning!

 I've asked (and seen other people ask) about the ability to drop old
 sstables, basically creating a FIFO-like clean-up process.  Since we're
 using Cassandra as an auditing system, this is particularly appealing to us
 because it means we can maximize the amount of auditing data we can keep
 while still allowing Cassandra to clear old data automatically.

 My idea is this: perform compaction based on the range of dates available
 in the sstable (or just metadata about when it was created).  For example,
 a major compaction could create a combined sstable per day--so that, say,
 60 days of data after a major compaction would contain 60 sstables.

 My question then is, will this be possible by simply implementing a
 separate AbstractCompactionStrategy?  Does this sound feasilble at all?
 Based on the implementation of Size and Leveled strategies, it looks like I
 would have the ability to control what and how things get compacted, but I
 wanted to verify before putting time into it.

 Thank you so much for your time!

 Andrew

Re: Customized Compaction Strategy: Dev Questions

I’m not sure what you want to do is feasible.  At a high level I can see you 
running into issues with RF etc.  The SSTables node to node are not identical, 
so if you drop a full SSTable on one node there is no one corresponding SSTable 
on the adjacent nodes to drop.    You would need to choose data to compact out, 
and ensure it is removed on all replicas as well.  But if your problem is that 
you’re low on disk space then you probably won’t be able to write out a new 
SSTable with the older information compacted out. Also, there is more to an 
SSTable than just data, the SSTable could have tombstones and other relics that 
haven’t been cleaned up from nodes coming or going. 




On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

Thanks, Russell--yes, a similar concept, just applied to sstables.  I'm 
assuming this would require changes to both major compactions, and probably GC 
(to remove the old tables), but since I'm not super-familiar with the C* 
internals, I wanted to make sure it was feasible with the current toolset 
before I actually dived in and started tinkering.

Andrew


On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote:
hmm, I see. So something similar to Capped Collections in MongoDB.



On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable 
rather than simply run out of space.

The problem with using TTLs is that I have to try and guess how much data is 
being put in--since this is auditing data, the usage can vary wildly depending 
on time of year, verbosity of auditing, etc..  I'd like to maximize the disk 
space--not optimize the cleanup process.

Andrew


On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote:
You mean this:

https://issues.apache.org/jira/browse/CASSANDRA-5228

?



On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

Good morning!

I've asked (and seen other people ask) about the ability to drop old sstables, 
basically creating a FIFO-like clean-up process.  Since we're using Cassandra 
as an auditing system, this is particularly appealing to us because it means we 
can maximize the amount of auditing data we can keep while still allowing 
Cassandra to clear old data automatically.

My idea is this: perform compaction based on the range of dates available in 
the sstable (or just metadata about when it was created).  For example, a major 
compaction could create a combined sstable per day--so that, say, 60 days of 
data after a major compaction would contain 60 sstables.

My question then is, will this be possible by simply implementing a separate 
AbstractCompactionStrategy?  Does this sound feasilble at all?  Based on the 
implementation of Size and Leveled strategies, it looks like I would have the 
ability to control what and how things get compacted, but I wanted to verify 
before putting time into it.

Thank you so much for your time!

Andrew

Linux containers, docker, SSD, and RAID.

2014-06-04 Thread Kevin Burton

Hey guys.

Question about using container with Cassandra.

I think we will eventually deploy on containers… lxc with docker probably.

Our first config will have one cassandra daemon per box.

Of course there are issues here.  Larger per VM heap means more GC time and
potential stop the world and latency issues.  And we also have to run SSD
on RAID which is no fun.

So I think what we're planning on doing is running with 32-64GB boxes, with
8-16GB of memory per container.

If we have 4x SSDs on a box, then we can have each container have its own
SSD, it's own memory, etc.

One issue is the data placement.  Obviously we don't want to put all the
data on the same box… so I was thinking of telling it that each lxc is on
the same rack.

Right now there's data centers , and racks, which you have to focus on in
terms of replica placement.

But now there's one additional level… host.

So I was thinking we could just have rack IDs be rack.host… or rack_host.
 This way cassandra knows not to place a replica on the same host but just
in a different container.

Thoughts?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Too Many Open Files (sockets) - VNodes - Map/Reduce Job

2014-06-04 Thread Michael Shuler


(this is probably a better question for the user list - cc/reply-to set)

Allow more files to be open  :)

http://www.datastax.com/documentation/cassandra/1.2/cassandra/install/installRecommendSettings.html

--
Kind regards,
Michael


On 06/04/2014 12:15 PM, Florian Dambrine wrote:

Hi every body,

We are running ElasticMapReduce Jobs from Amazon on a 25 nodes Cassandra
cluster (with VNodes). Since we have increased the size of the cluster we
are facing a too many open files (due to sockets) exception when creating
the splits. Does anyone has an idea?

Thanks,

Here is the stacktrace:


14/06/04 03:23:24 INFO mapred.JobClient: Default number of map tasks: null
14/06/04 03:23:24 INFO mapred.JobClient: Setting default number of map
tasks based on cluster size to : 80
14/06/04 03:23:24 INFO mapred.JobClient: Default number of reduce tasks: 26
14/06/04 03:23:25 INFO security.ShellBasedUnixGroupsMapping: add
hadoop to shell userGroupsCache
14/06/04 03:23:25 INFO mapred.JobClient: Setting group to hadoop
14/06/04 03:23:41 ERROR transport.TSocket: Could not configure socket.
java.net.SocketException: Too many open files
at java.net.Socket.createImpl(Socket.java:447)
at java.net.Socket.getImpl(Socket.java:510)
at java.net.Socket.setSoLinger(Socket.java:984)
at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:118)
at org.apache.thrift.transport.TSocket.init(TSocket.java:109)
at org.apache.thrift.transport.TSocket.init(TSocket.java:94)
at 
org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:39)
at 
org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:558)
at 
org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSubSplits(AbstractColumnFamilyInputFormat.java:286)
at 
org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.access$200(AbstractColumnFamilyInputFormat.java:61)
at 
org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:236)
at 
org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat$SplitCallable.call(AbstractColumnFamilyInputFormat.java:221)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Re: High latency on 5 node Cassandra Cluster

2014-06-04 Thread Nate McCall

That is a pretty old version of Cassandra at this point.

If you are using counters anywhere, you are probably seeing
https://issues.apache.org/jira/browse/CASSANDRA-4578 which only shows up
after you hit some arbitrary traffic threshold.

If you don't want to upgrade (you really should), there was an update for
the above in the 1.0 branch which was never released:
https://github.com/apache/cassandra/blob/cassandra-1.0/CHANGES.txt#L2


On Wed, Jun 4, 2014 at 2:12 AM, Arup Chakrabarti a...@pagerduty.com wrote:

 Hello. We had some major latency problems yesterday with our 5 node
 cassandra cluster. Wanted to get some feedback on where we could start to
 look to figure out what was causing the issue. If there is more info I
 should provide, please let me know.

 Here are the basics of the cluster:
 Clients: Hector and Cassie
 Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont)
 Replication Factor: 5
 Quorum Reads and Writes enabled
 Read Repair set to true
 Cassandra Version: 1.0.12

 We started experiencing catastrophic latency from our app servers. We
 believed at the time this was due to compactions running, and the clients
 were not re-routing appropriately, so we disabled thrift on a single node
 that had high load. This did not resolve the issue. After that, we stopped
 gossip on the same node that had high load on it, again this did not
 resolve anything. We then took down gossip on another node (leaving 3/5 up)
 and that fixed the latency from the application side. For a period of ~4
 hours, every time we would try to bring up a fourth node, the app would see
 the latency again. We then rotated the three nodes that were up to make
 sure it was not a networking event related to a single region/provider and
 we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5
 nodes would. After the ~4hours, we brought the cluster up to 5 nodes and
 everything was fine.

 We currently have some ideas on what caused this behavior, but has anyone
 else seen this type of problem where a full cluster causes problems, but
 removing nodes fixes it? Any input on what to look for in our logs to
 understand the issue?

 Thanks

 Arup




-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder  Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Customized Compaction Strategy: Dev Questions

Let's say I run a major compaction every day, so that the oldest sstable
contains only the data for January 1st.  Assuming all the nodes are in-sync
and have had at least one repair run before the table is dropped (so that
all information for that time period is the same), wouldn't it be safe to
assume that the same data would be dropped on all nodes?  There might be a
period when the compaction is running where different nodes might have an
inconsistent view of just that days' data (in that some would have it and
others would not), but the cluster would still function and become
eventually consistent, correct?

Also, if the entirety of the sstable is being dropped, wouldn't the
tombstones be removed with it?  I wouldn't be concerned with individual
rows and columns, and this is a write-only table, more or less--the only
deletes that occur in the current system are to delete the old data.


On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com
wrote:

 I’m not sure what you want to do is feasible.  At a high level I can see
 you running into issues with RF etc.  The SSTables node to node are not
 identical, so if you drop a full SSTable on one node there is no one
 corresponding SSTable on the adjacent nodes to drop.You would need to
 choose data to compact out, and ensure it is removed on all replicas as
 well.  But if your problem is that you’re low on disk space then you
 probably won’t be able to write out a new SSTable with the older
 information compacted out. Also, there is more to an SSTable than just
 data, the SSTable could have tombstones and other relics that haven’t been
 cleaned up from nodes coming or going.




 On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

  Thanks, Russell--yes, a similar concept, just applied to sstables.  I'm
 assuming this would require changes to both major compactions, and probably
 GC (to remove the old tables), but since I'm not super-familiar with the C*
 internals, I wanted to make sure it was feasible with the current toolset
 before I actually dived in and started tinkering.

 Andrew


 On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

  hmm, I see. So something similar to Capped Collections in MongoDB.



 On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

   Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest
 sstable rather than simply run out of space.

 The problem with using TTLs is that I have to try and guess how much data
 is being put in--since this is auditing data, the usage can vary wildly
 depending on time of year, verbosity of auditing, etc..  I'd like to
 maximize the disk space--not optimize the cleanup process.

 Andrew


 On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

  You mean this:

  https://issues.apache.org/jira/browse/CASSANDRA-5228

  ?



 On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

   Good morning!

 I've asked (and seen other people ask) about the ability to drop old
 sstables, basically creating a FIFO-like clean-up process.  Since we're
 using Cassandra as an auditing system, this is particularly appealing to us
 because it means we can maximize the amount of auditing data we can keep
 while still allowing Cassandra to clear old data automatically.

 My idea is this: perform compaction based on the range of dates
 available in the sstable (or just metadata about when it was created).  For
 example, a major compaction could create a combined sstable per day--so
 that, say, 60 days of data after a major compaction would contain 60
 sstables.

 My question then is, will this be possible by simply implementing a
 separate AbstractCompactionStrategy?  Does this sound feasilble at all?
 Based on the implementation of Size and Leveled strategies, it looks like I
 would have the ability to control what and how things get compacted, but I
 wanted to verify before putting time into it.

 Thank you so much for your time!

 Andrew

Re: Customized Compaction Strategy: Dev Questions

Maybe I’m misunderstanding something, but what makes you think that running a 
major compaction every day will cause they data from January 1st to exist in 
only one SSTable and not have data from other days in the SSTable as well? Are 
you talking about making a new compaction strategy that creates SSTables by day?



On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:

Let's say I run a major compaction every day, so that the oldest sstable 
contains only the data for January 1st.  Assuming all the nodes are in-sync and 
have had at least one repair run before the table is dropped (so that all 
information for that time period is the same), wouldn't it be safe to assume 
that the same data would be dropped on all nodes?  There might be a period when 
the compaction is running where different nodes might have an inconsistent view 
of just that days' data (in that some would have it and others would not), but 
the cluster would still function and become eventually consistent, correct?

Also, if the entirety of the sstable is being dropped, wouldn't the tombstones 
be removed with it?  I wouldn't be concerned with individual rows and columns, 
and this is a write-only table, more or less--the only deletes that occur in 
the current system are to delete the old data.


On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote:
I’m not sure what you want to do is feasible.  At a high level I can see you 
running into issues with RF etc.  The SSTables node to node are not identical, 
so if you drop a full SSTable on one node there is no one corresponding SSTable 
on the adjacent nodes to drop.    You would need to choose data to compact out, 
and ensure it is removed on all replicas as well.  But if your problem is that 
you’re low on disk space then you probably won’t be able to write out a new 
SSTable with the older information compacted out. Also, there is more to an 
SSTable than just data, the SSTable could have tombstones and other relics that 
haven’t been cleaned up from nodes coming or going. 




On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

Thanks, Russell--yes, a similar concept, just applied to sstables.  I'm 
assuming this would require changes to both major compactions, and probably GC 
(to remove the old tables), but since I'm not super-familiar with the C* 
internals, I wanted to make sure it was feasible with the current toolset 
before I actually dived in and started tinkering.

Andrew


On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote:
hmm, I see. So something similar to Capped Collections in MongoDB.



On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable 
rather than simply run out of space.

The problem with using TTLs is that I have to try and guess how much data is 
being put in--since this is auditing data, the usage can vary wildly depending 
on time of year, verbosity of auditing, etc..  I'd like to maximize the disk 
space--not optimize the cleanup process.

Andrew


On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote:
You mean this:

https://issues.apache.org/jira/browse/CASSANDRA-5228

?



On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

Good morning!

I've asked (and seen other people ask) about the ability to drop old sstables, 
basically creating a FIFO-like clean-up process.  Since we're using Cassandra 
as an auditing system, this is particularly appealing to us because it means we 
can maximize the amount of auditing data we can keep while still allowing 
Cassandra to clear old data automatically.

My idea is this: perform compaction based on the range of dates available in 
the sstable (or just metadata about when it was created).  For example, a major 
compaction could create a combined sstable per day--so that, say, 60 days of 
data after a major compaction would contain 60 sstables.

My question then is, will this be possible by simply implementing a separate 
AbstractCompactionStrategy?  Does this sound feasilble at all?  Based on the 
implementation of Size and Leveled strategies, it looks like I would have the 
ability to control what and how things get compacted, but I wanted to verify 
before putting time into it.

Thank you so much for your time!

Andrew

Re: Customized Compaction Strategy: Dev Questions

Sorry, yes, that is what I was looking to do--i.e., create a
TopologicalCompactionStrategy or similar.


On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com
wrote:

 Maybe I’m misunderstanding something, but what makes you think that
 running a major compaction every day will cause they data from January 1st
 to exist in only one SSTable and not have data from other days in the
 SSTable as well? Are you talking about making a new compaction strategy
 that creates SSTables by day?



 On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:

  Let's say I run a major compaction every day, so that the oldest
 sstable contains only the data for January 1st.  Assuming all the nodes are
 in-sync and have had at least one repair run before the table is dropped
 (so that all information for that time period is the same), wouldn't it
 be safe to assume that the same data would be dropped on all nodes?  There
 might be a period when the compaction is running where different nodes
 might have an inconsistent view of just that days' data (in that some would
 have it and others would not), but the cluster would still function and
 become eventually consistent, correct?

 Also, if the entirety of the sstable is being dropped, wouldn't the
 tombstones be removed with it?  I wouldn't be concerned with individual
 rows and columns, and this is a write-only table, more or less--the only
 deletes that occur in the current system are to delete the old data.


 On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

  I’m not sure what you want to do is feasible.  At a high level I can
 see you running into issues with RF etc.  The SSTables node to node are not
 identical, so if you drop a full SSTable on one node there is no one
 corresponding SSTable on the adjacent nodes to drop.You would need to
 choose data to compact out, and ensure it is removed on all replicas as
 well.  But if your problem is that you’re low on disk space then you
 probably won’t be able to write out a new SSTable with the older
 information compacted out. Also, there is more to an SSTable than just
 data, the SSTable could have tombstones and other relics that haven’t been
 cleaned up from nodes coming or going.




 On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

   Thanks, Russell--yes, a similar concept, just applied to sstables.
 I'm assuming this would require changes to both major compactions, and
 probably GC (to remove the old tables), but since I'm not super-familiar
 with the C* internals, I wanted to make sure it was feasible with the
 current toolset before I actually dived in and started tinkering.

 Andrew


 On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

  hmm, I see. So something similar to Capped Collections in MongoDB.



 On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

   Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest
 sstable rather than simply run out of space.

 The problem with using TTLs is that I have to try and guess how much
 data is being put in--since this is auditing data, the usage can vary
 wildly depending on time of year, verbosity of auditing, etc..  I'd like to
 maximize the disk space--not optimize the cleanup process.

 Andrew


 On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

  You mean this:

  https://issues.apache.org/jira/browse/CASSANDRA-5228

  ?



 On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

   Good morning!

 I've asked (and seen other people ask) about the ability to drop old
 sstables, basically creating a FIFO-like clean-up process.  Since we're
 using Cassandra as an auditing system, this is particularly appealing to us
 because it means we can maximize the amount of auditing data we can keep
 while still allowing Cassandra to clear old data automatically.

 My idea is this: perform compaction based on the range of dates
 available in the sstable (or just metadata about when it was created).  For
 example, a major compaction could create a combined sstable per day--so
 that, say, 60 days of data after a major compaction would contain 60
 sstables.

 My question then is, will this be possible by simply implementing a
 separate AbstractCompactionStrategy?  Does this sound feasilble at all?
 Based on the implementation of Size and Leveled strategies, it looks like I
 would have the ability to control what and how things get compacted, but I
 wanted to verify before putting time into it.

 Thank you so much for your time!

 Andrew

Re: Customized Compaction Strategy: Dev Questions

2014-06-04 Thread Jonathan Haddad

I'd suggest creating 1 table per day, and dropping the tables you don't
need once you're done.


On Wed, Jun 4, 2014 at 10:44 AM, Redmumba redmu...@gmail.com wrote:

 Sorry, yes, that is what I was looking to do--i.e., create a
 TopologicalCompactionStrategy or similar.


 On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

 Maybe I’m misunderstanding something, but what makes you think that
 running a major compaction every day will cause they data from January 1st
 to exist in only one SSTable and not have data from other days in the
 SSTable as well? Are you talking about making a new compaction strategy
 that creates SSTables by day?



 On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:

  Let's say I run a major compaction every day, so that the oldest
 sstable contains only the data for January 1st.  Assuming all the nodes are
 in-sync and have had at least one repair run before the table is dropped
 (so that all information for that time period is the same), wouldn't it
 be safe to assume that the same data would be dropped on all nodes?  There
 might be a period when the compaction is running where different nodes
 might have an inconsistent view of just that days' data (in that some would
 have it and others would not), but the cluster would still function and
 become eventually consistent, correct?

 Also, if the entirety of the sstable is being dropped, wouldn't the
 tombstones be removed with it?  I wouldn't be concerned with individual
 rows and columns, and this is a write-only table, more or less--the only
 deletes that occur in the current system are to delete the old data.


 On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

  I’m not sure what you want to do is feasible.  At a high level I can
 see you running into issues with RF etc.  The SSTables node to node are not
 identical, so if you drop a full SSTable on one node there is no one
 corresponding SSTable on the adjacent nodes to drop.You would need to
 choose data to compact out, and ensure it is removed on all replicas as
 well.  But if your problem is that you’re low on disk space then you
 probably won’t be able to write out a new SSTable with the older
 information compacted out. Also, there is more to an SSTable than just
 data, the SSTable could have tombstones and other relics that haven’t been
 cleaned up from nodes coming or going.




 On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

   Thanks, Russell--yes, a similar concept, just applied to sstables.
 I'm assuming this would require changes to both major compactions, and
 probably GC (to remove the old tables), but since I'm not super-familiar
 with the C* internals, I wanted to make sure it was feasible with the
 current toolset before I actually dived in and started tinkering.

 Andrew


 On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com
  wrote:

  hmm, I see. So something similar to Capped Collections in MongoDB.



 On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

   Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest
 sstable rather than simply run out of space.

 The problem with using TTLs is that I have to try and guess how much
 data is being put in--since this is auditing data, the usage can vary
 wildly depending on time of year, verbosity of auditing, etc..  I'd like to
 maximize the disk space--not optimize the cleanup process.

 Andrew


 On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com
  wrote:

  You mean this:

  https://issues.apache.org/jira/browse/CASSANDRA-5228

  ?



 On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

   Good morning!

 I've asked (and seen other people ask) about the ability to drop old
 sstables, basically creating a FIFO-like clean-up process.  Since we're
 using Cassandra as an auditing system, this is particularly appealing to 
 us
 because it means we can maximize the amount of auditing data we can keep
 while still allowing Cassandra to clear old data automatically.

 My idea is this: perform compaction based on the range of dates
 available in the sstable (or just metadata about when it was created).  
 For
 example, a major compaction could create a combined sstable per day--so
 that, say, 60 days of data after a major compaction would contain 60
 sstables.

 My question then is, will this be possible by simply implementing a
 separate AbstractCompactionStrategy?  Does this sound feasilble at all?
 Based on the implementation of Size and Leveled strategies, it looks like 
 I
 would have the ability to control what and how things get compacted, but I
 wanted to verify before putting time into it.

 Thank you so much for your time!

 Andrew








-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Customized Compaction Strategy: Dev Questions

That still involves quite a bit of infrastructure work--it also means that
to query the data, I would have to make N queries, one per table, to query
for audit information (audit information is sorted by a key identifying the
item, and then the date).  I don't think this would yield any benefit (to
me) over simply tombstoning the values or creating a secondary index on
date and simply doing a DELETE, right?

Is there something internally preventing me from implementing this as a
separate Strategy?


On Wed, Jun 4, 2014 at 10:47 AM, Jonathan Haddad j...@jonhaddad.com wrote:

 I'd suggest creating 1 table per day, and dropping the tables you don't
 need once you're done.


 On Wed, Jun 4, 2014 at 10:44 AM, Redmumba redmu...@gmail.com wrote:

 Sorry, yes, that is what I was looking to do--i.e., create a
 TopologicalCompactionStrategy or similar.


 On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

 Maybe I’m misunderstanding something, but what makes you think that
 running a major compaction every day will cause they data from January 1st
 to exist in only one SSTable and not have data from other days in the
 SSTable as well? Are you talking about making a new compaction strategy
 that creates SSTables by day?



 On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:

  Let's say I run a major compaction every day, so that the oldest
 sstable contains only the data for January 1st.  Assuming all the nodes are
 in-sync and have had at least one repair run before the table is dropped
 (so that all information for that time period is the same), wouldn't it
 be safe to assume that the same data would be dropped on all nodes?  There
 might be a period when the compaction is running where different nodes
 might have an inconsistent view of just that days' data (in that some would
 have it and others would not), but the cluster would still function and
 become eventually consistent, correct?

 Also, if the entirety of the sstable is being dropped, wouldn't the
 tombstones be removed with it?  I wouldn't be concerned with individual
 rows and columns, and this is a write-only table, more or less--the only
 deletes that occur in the current system are to delete the old data.


 On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com
  wrote:

  I’m not sure what you want to do is feasible.  At a high level I can
 see you running into issues with RF etc.  The SSTables node to node are not
 identical, so if you drop a full SSTable on one node there is no one
 corresponding SSTable on the adjacent nodes to drop.You would need to
 choose data to compact out, and ensure it is removed on all replicas as
 well.  But if your problem is that you’re low on disk space then you
 probably won’t be able to write out a new SSTable with the older
 information compacted out. Also, there is more to an SSTable than just
 data, the SSTable could have tombstones and other relics that haven’t been
 cleaned up from nodes coming or going.




 On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

   Thanks, Russell--yes, a similar concept, just applied to sstables.
 I'm assuming this would require changes to both major compactions, and
 probably GC (to remove the old tables), but since I'm not super-familiar
 with the C* internals, I wanted to make sure it was feasible with the
 current toolset before I actually dived in and started tinkering.

 Andrew


 On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry 
 rbradbe...@gmail.com wrote:

  hmm, I see. So something similar to Capped Collections in MongoDB.



 On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

   Not quite; if I'm at say 90% disk usage, I'd like to drop the
 oldest sstable rather than simply run out of space.

 The problem with using TTLs is that I have to try and guess how much
 data is being put in--since this is auditing data, the usage can vary
 wildly depending on time of year, verbosity of auditing, etc..  I'd like 
 to
 maximize the disk space--not optimize the cleanup process.

 Andrew


 On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry 
 rbradbe...@gmail.com wrote:

  You mean this:

  https://issues.apache.org/jira/browse/CASSANDRA-5228

  ?



 On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

   Good morning!

 I've asked (and seen other people ask) about the ability to drop old
 sstables, basically creating a FIFO-like clean-up process.  Since we're
 using Cassandra as an auditing system, this is particularly appealing to 
 us
 because it means we can maximize the amount of auditing data we can keep
 while still allowing Cassandra to clear old data automatically.

 My idea is this: perform compaction based on the range of dates
 available in the sstable (or just metadata about when it was created).  
 For
 example, a major compaction could create a combined sstable per day--so
 that, say, 60 days of data after a major compaction would contain 60

Re: Customized Compaction Strategy: Dev Questions

Well, DELETE will not free up disk space until after GC grace has passed and 
the next major compaction has run. So in essence, if you need to free up space 
right away, then creating daily/monthly tables would be one way to go.  Just 
remember to clear your snapshots after dropping though.



On June 4, 2014 at 1:54:05 PM, Redmumba (redmu...@gmail.com) wrote:

That still involves quite a bit of infrastructure work--it also means that to 
query the data, I would have to make N queries, one per table, to query for 
audit information (audit information is sorted by a key identifying the item, 
and then the date).  I don't think this would yield any benefit (to me) over 
simply tombstoning the values or creating a secondary index on date and simply 
doing a DELETE, right?

Is there something internally preventing me from implementing this as a 
separate Strategy?


On Wed, Jun 4, 2014 at 10:47 AM, Jonathan Haddad j...@jonhaddad.com wrote:
I'd suggest creating 1 table per day, and dropping the tables you don't need 
once you're done.


On Wed, Jun 4, 2014 at 10:44 AM, Redmumba redmu...@gmail.com wrote:
Sorry, yes, that is what I was looking to do--i.e., create a 
TopologicalCompactionStrategy or similar.


On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com wrote:
Maybe I’m misunderstanding something, but what makes you think that running a 
major compaction every day will cause they data from January 1st to exist in 
only one SSTable and not have data from other days in the SSTable as well? Are 
you talking about making a new compaction strategy that creates SSTables by day?



On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:

Let's say I run a major compaction every day, so that the oldest sstable 
contains only the data for January 1st.  Assuming all the nodes are in-sync and 
have had at least one repair run before the table is dropped (so that all 
information for that time period is the same), wouldn't it be safe to assume 
that the same data would be dropped on all nodes?  There might be a period when 
the compaction is running where different nodes might have an inconsistent view 
of just that days' data (in that some would have it and others would not), but 
the cluster would still function and become eventually consistent, correct?

Also, if the entirety of the sstable is being dropped, wouldn't the tombstones 
be removed with it?  I wouldn't be concerned with individual rows and columns, 
and this is a write-only table, more or less--the only deletes that occur in 
the current system are to delete the old data.


On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote:
I’m not sure what you want to do is feasible.  At a high level I can see you 
running into issues with RF etc.  The SSTables node to node are not identical, 
so if you drop a full SSTable on one node there is no one corresponding SSTable 
on the adjacent nodes to drop.    You would need to choose data to compact out, 
and ensure it is removed on all replicas as well.  But if your problem is that 
you’re low on disk space then you probably won’t be able to write out a new 
SSTable with the older information compacted out. Also, there is more to an 
SSTable than just data, the SSTable could have tombstones and other relics that 
haven’t been cleaned up from nodes coming or going. 




On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

Thanks, Russell--yes, a similar concept, just applied to sstables.  I'm 
assuming this would require changes to both major compactions, and probably GC 
(to remove the old tables), but since I'm not super-familiar with the C* 
internals, I wanted to make sure it was feasible with the current toolset 
before I actually dived in and started tinkering.

Andrew


On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote:
hmm, I see. So something similar to Capped Collections in MongoDB.



On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable 
rather than simply run out of space.

The problem with using TTLs is that I have to try and guess how much data is 
being put in--since this is auditing data, the usage can vary wildly depending 
on time of year, verbosity of auditing, etc..  I'd like to maximize the disk 
space--not optimize the cleanup process.

Andrew


On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote:
You mean this:

https://issues.apache.org/jira/browse/CASSANDRA-5228

?



On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

Good morning!

I've asked (and seen other people ask) about the ability to drop old sstables, 
basically creating a FIFO-like clean-up process.  Since we're using Cassandra 
as an auditing system, this is particularly appealing to us because it means we 
can maximize the amount of auditing data we can keep while still allowing 
Cassandra to

Re: High latency on 5 node Cassandra Cluster

On Wed, Jun 4, 2014 at 12:12 AM, Arup Chakrabarti a...@pagerduty.com
wrote:

 Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont)
 Replication Factor: 5


You're operating with a single-DC strategy across multiple data centers? If
so, I'm surprised you get sane latency ever.

(Or do you mean RF : 2,2,1?)

I agree with others that problems which can cause cluster wide outages
exist in Gossip in the version of Cassandra you are running.

As a general piece of feedback, I suggest an upgrade, first to 1.1 HEAD,
then 1.2.16.

=Rob

Re: New node Unable to gossip with any seeds

2014-06-04 Thread Chris Burroughs

This generally means that how you are describing the see nodes address 
doesn't match how it's described in the second node seeds list in the 
correct way.


CASSANDRA-6523 has some links that might be helpful.

On 05/26/2014 12:07 AM, Tim Dunphy wrote:

Hello,

  I am trying to spin up a new node using cassandra 2.0.7. Both nodes are at
Digital Ocean. The seed node is up and running and I can telnet to port
7000 on that host from the node I'm trying to start.

[root@cassandra02 apache-cassandra-2.0.7]# telnet 10.10.1.94 7000

Trying 10.10.1.94...

Connected to 10.10.1.94.

Escape character is '^]'.

But when I start cassandra on the new node I see the following exception:


INFO 00:01:34,744 Handshaking version with /10.10.1.94

ERROR 00:02:05,733 Exception encountered during startup

java.lang.RuntimeException: Unable to gossip with any seeds

 at
org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193)

 at
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447)

 at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:505)

 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362)

 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569)

java.lang.RuntimeException: Unable to gossip with any seeds

 at
org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193)

 at
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447)

 at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:505)

 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362)

 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569)

Exception encountered during startup: Unable to gossip with any seeds

ERROR 00:02:05,742 Exception in thread
Thread[StorageServiceShutdownHook,5,main]

java.lang.NullPointerException

 at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1270)

 at
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:573)

 at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

 at java.lang.Thread.run(Thread.java:745)



I'm using the murmur3 partition on both nodes and I have the seed node's IP
listed in the cassandra.yaml of the new node. I'm just wondering what the
issue might be and how I can get around it.


Thanks

Tim

Re: alternative vnode upgrade strategy?

2014-06-04 Thread Chris Burroughs


On 05/28/2014 02:18 PM, William Oberman wrote:

1.) Upgrade all N nodes to vnodes in place
Start loop
2.) Boot a new node and let it bootstrap
3.) Decommission an old node
End loop


I's been a while since I had to think about the vnode migration, but 
I've think this would fall pray to 
https://issues.apache.org/jira/browse/CASSANDRA-5525

Re: Number of rows under one partition key

On Wed, Jun 4, 2014 at 12:39 PM, Chris Burroughs chris.burrou...@gmail.com
wrote:

 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

 Although by the simplistic version count hueirstic the sheer quantity of
 releases in the 2.0.x line would now satisfy the constraint.


Yes, I was specific about the 2.0 line instead of pasting that post because
2.0 has shown itself to be slightly worse than the average major release.

To answer Paulo's question, it is these serious-class bugs in 2.0 line. I
have yet to hear of a point release of 2.0.x which does not contain bugs I
consider prohibitive for production use, though I have high hopes for 2.0.9.

=Rob

Snapshot the data with 3 node and replicationfactor=3

2014-06-04 Thread ng

Is there any reason you would like to take snapshot of column family on
each node when cluster consists of 3 nodes with keyspace on replication
factor =3?


I am thinking of taking snapshot of CF on only one node.

For restore, I will follow below

1. drop and recreate the CF on node1
2. copy snapshotted files to node 1 data directory of CF
3. perform nodetool refresh on node 1



Any suggestions/advise?

ng

Re: problem removing dead node from ring

On Tue, Jun 3, 2014 at 9:03 PM, Matthew Allen matthew.j.al...@gmail.com
wrote:

 Thanks Robert, this makes perfect sense.  Do you know if CASSANDRA-6961
 will be ported to 1.2.x ?


I just asked driftx, he said not gonna happen.


 And apologies if these appear to be dumb questions, but is a repair more
 suitable than a rebuild because the rebuild only contacts 1 replica (per
 range), which may itself contain stale data ?


Exactly that.

https://issues.apache.org/jira/browse/CASSANDRA-2434

Discusses related issues in quite some detail. The tl;dr is that until 2434
is resolved, streams do not necessarily come from the node departing the
range, and therefore the unique replica count is decreased by changing
cluster topology.

=Rob

Re: Snapshot the data with 3 node and replicationfactor=3

On Wed, Jun 4, 2014 at 1:26 PM, ng pipeli...@gmail.com wrote:

 Is there any reason you would like to take snapshot of column family on
 each node when cluster consists of 3 nodes with keyspace on replication
 factor =3?


Unless all read/write occurs with CL.ALL (which is an availability
problem), there is a nonzero chance of any given write not being on any
given node at any given time.

=Rob

Re: Snapshot the data with 3 node and replicationfactor=3

2014-06-04 Thread ng

I am not worried about eventually consistent data. I just wanted to get
rough data in close proximate.
ng


On Wed, Jun 4, 2014 at 2:49 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jun 4, 2014 at 1:26 PM, ng pipeli...@gmail.com wrote:

 Is there any reason you would like to take snapshot of column family on
 each node when cluster consists of 3 nodes with keyspace on replication
 factor =3?


 Unless all read/write occurs with CL.ALL (which is an availability
 problem), there is a nonzero chance of any given write not being on any
 given node at any given time.

 =Rob

nodetool move seems slow

2014-06-04 Thread Jason Tyler

Hello,

We have a 5-node cluster runing cassandra 1.2.16, with a significant amount of 
data:


AddressRackStatus State   LoadOwns
Token

  
6783174585269344219

10.198.xx.xx1  rack1   Up Normal  2.59 TB 60.00%  
-9223372036854775808

10.198.xx.xx2  rack1   Up Normal  1.49 TB 40.00%  
-5534023222112865485

10.198.xx.xx3  rack1   Up Normal  2.18 TB 53.23%  
-1844674407370955162

10.198.xx.xx4  rack1   Up Normal  2.86 TB 80.00%  
5534023222112865484

10.198.xx.xx5  rack1   Up Moving  2.32 TB 66.77%  
6783174585269344219



The first three nodes (.xx1 - .xx3 above) were at the desired tokens, so I 
issued a move on .xx4:

nodetool move 1844674407370955161


That was about 40hrs ago!


When I do nodetool netstats, I do see apparent progress:


jatyler@xx4:~$ nodetool netstats

Mode: MOVING

Not sending any streams.

Streaming from: /10.198.xx.xx2

   SyncCore: /var/cassandra/data/SyncCore/file-ic-31475-Data.db sections=1 
progress=0/77699597 - 0%

…

   SyncCore: /var/cassandra/data/SyncCore/anotherFile-ic-32252-Data.db 
sections=1 progress=0/1254063427 - 0%

Read Repair Statistics:

Attempted: 8047367

Mismatch (Blocking): 97327

Mismatch (Background): 74369

Pool NameActive   Pending  Completed

Commandsn/a 0  472255111

Responses   n/a 1  749751322



I wrote 'apparent progress' because it reports “MOVING” and the Pending 
Commands/Responses are changing over time.  However, I haven’t seen the 
individual .db files progress go above 0%.

Meanwhile, the system appears to have plenty of unused bandwidth, from 'iostat 
-x -m 1':


Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await  svctm  %util

sda   0.0056.00 1338.00  171.0057.59 0.8979.36 
0.570.38   0.17  25.30


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

  22.771.822.350.200.00   72.86


Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await  svctm  %util

sda   0.00 0.00  785.000.0033.80 0.0088.17 
0.270.35   0.18  14.10


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

  20.162.052.220.200.00   75.37




Is 40 hours too long for this move?  Should I be seeing individual .db files 
report more progress?  Should I start with the first box (even though the token 
appears correct)?


Any thoughts would be greatly appreciated.

THX


Cheers,

~Jason
***

Re: Consolidating records and TTL

2014-06-04 Thread Tyler Hobbs

Just use an atomic batch that holds both the insert and deletes:
http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2


On Tue, Jun 3, 2014 at 2:13 PM, Charlie Mason charlie@gmail.com wrote:

 Hi All.

 I have a system thats going to make possibly several concurrent changes to
 a running total. I know I could use a counter for this. However I have
 extra meta data I can store with the changes which would allow me to reply
 the changes. If I use a counter and it looses some writes I can't recover
 it as I will only have its current total not the extra meta data to know
 where to replay from.

 What I was planning to do was write each change of the value to a CQL
 table with a Time UUID as a row level primary key as well as a partition
 key. Then when I need to read the running total back I will do a query for
 all the changes and add them up to get the total.

 As there could be tens of thousands of these I want to have a period after
 which these are consolidated. Most won't be any where near that but a few
 will which I need to be able to support. So I was also going to have a
 consolidated total table which holds the UUID of the values consolidated up
 to. Since I can bound the query for the recent updates by the UUID I should
 be able to avoid all the tombstones. So if the read encounters any changes
 that can be consolidated it inserts a new consolidated value and deletes
 the newly consolidated changes.

 What I am slightly worried about is what happens if the consolidated value
 insert fails but the deletes to the change records succeed. I would be left
 with an inconsistent total indefinitely. I have come up with a couple of
 ideas:


 1, I could make it require all nodes to acknowledge it before deleting the
 difference records.

 2, May be I could have another period after its consolidated but before
 its deleted?

 3, Is there anyway I could use the TTL to allow to it to be deleted after
 a period of time? Chances are another read would come in and fix the value.


 Anyone got any other suggestions on how I could implement this?


 Thanks,

 Charlie M




-- 
Tyler Hobbs
DataStax http://datastax.com/

Re: nodetool move seems slow