Re: Pending compactions not going down on some nodes of the cluster

2016-03-21 Thread Gianluca Borello
Thank you for your reply. To address your points:

- We are not running repairs

- Yes, we are storing timeseries-like binary blobs where data is heavily
TTLed (essentially the entire column family is incrementally refreshed with
completely new data every few days)

- I have tried with increasing compactors and even putting the compaction
throughput to unlimited, and nothing changes. Again, the key here to
remember is that if I drain the node, the compactions completely stop after
a few minutes (like they would normally do on another "healthy" node), it's
just the "pending tasks" counter that stays high, and messes up my metrics
in OpsCenter

- As another data point: even if I increase the number of resources
allocated to compactions, I can pretty much measure that the disk I/O
generated by Cassandra is essentially the same as the other nodes that have
no pending compactions. In other words, it really seems like that number of
estimated pending compactions is somehow bogus

Thanks

On Mon, Mar 21, 2016 at 9:45 PM, Fabrice Facorat <fabrice.faco...@gmail.com>
wrote:

> Are you running repairs ?
>
> You may try:
> - increase concurrentçcompaction to 8 (max in 2.1.x)
> - increase compaction_throupghput to more than 16MB/s (48 may be a good
> start)
>
>
> What kind of data are you storing in theses tables ? timeseries ?
>
>
>
> 2016-03-21 23:37 GMT+01:00 Gianluca Borello <gianl...@sysdig.com>:
> > Thank you for your reply, definitely appreciate the tip on the compressed
> > size.
> >
> > I understand your point, in fact whenever we bootstrap a new node we see
> a
> > huge number of pending compactions (in the order of thousands), and they
> > usually decrease steadily until they reach 0 in just a few hours. With
> this
> > node, however, we are way beyond that point, it has been 3 days since the
> > number of pending compaction started fluctuating around ~150 without any
> > sign of going down (I can see from Opscenter it's almost a straight line
> > starting a few hours after the bootstrap). In particular, to reply to
> your
> > point:
> >
> > - The number of sstables for this CF on this node is around 250, which
> is in
> > the same range of all the other nodes in the cluster (I counted the
> number
> > on each one of them, and every node is in the 200-400 range)
> >
> > - This theory doesn't seem to explain why, when doing "nodetool drain",
> the
> > compactions completely stop after a few minutes and I get something such
> as:
> >
> > $ nodetool compactionstats
> > pending tasks: 128
> >
> > So no compactions being executed (since there is no more write activity),
> > but the pending number is still high.
> >
> > Thanks again
> >
> >
> > On Mon, Mar 21, 2016 at 3:19 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
> > wrote:
> >>
> >> > We added a bunch of new nodes to a cluster (2.1.13) and everything
> went
> >> > fine, except for the number of pending compactions that is staying
> quite
> >> > high on a subset of the new nodes. Over the past 3 days, the pending
> >> > compactions have never been less than ~130 on such nodes, with peaks
> of
> >> > ~200.
> >>
> >> When you bootstrap with Vnodes, you end up with thousands (or tens of
> >> thousands) of sstables – with 256 Vnodes (default) * 20 sstables per
> node,
> >> your resulting node will have 5k sstables. It takes quite a while for
> >> compaction to chew through that. If you added a bunch of nodes in
> sequence,
> >> you’d have 5k on the first node, then potentailly 10k on the next, and
> could
> >> potentially keep increasing as you start streaming from nodes that have
> way
> >> too many sstables.  This is one of the reasons that many people who
> have to
> >> grow their clusters frequently try not to use vnodes.
> >>
> >> From your other email:
> >>
> >> > Also related to this point, now I'm seeing something even more odd:
> some
> >> > compactions are way bigger than the size of the column family itself,
> such
> >> > as:
> >>
> >> The size reported by compactionstats is the uncompressed size – if
> you’re
> >> using compression, it’s perfectly reasonable for 30G of data to show up
> as
> >> 118G of data during compaction.
> >>
> >> - Jeff
> >>
> >> From: Gianluca Borello
> >> Reply-To: "user@cassandra.apache.org"
> >> Date: Monday, March 21, 2016 at 12:50 PM
> >> To: "user@cassandra.apache.org"
>

Re: Pending compactions not going down on some nodes of the cluster

2016-03-21 Thread Gianluca Borello
Thank you for your reply, definitely appreciate the tip on the compressed
size.

I understand your point, in fact whenever we bootstrap a new node we see a
huge number of pending compactions (in the order of thousands), and they
usually decrease steadily until they reach 0 in just a few hours. With this
node, however, we are way beyond that point, it has been 3 days since the
number of pending compaction started fluctuating around ~150 without any
sign of going down (I can see from Opscenter it's almost a straight line
starting a few hours after the bootstrap). In particular, to reply to your
point:

- The number of sstables for this CF on this node is around 250, which is
in the same range of all the other nodes in the cluster (I counted the
number on each one of them, and every node is in the 200-400 range)

- This theory doesn't seem to explain why, when doing "nodetool drain", the
compactions completely stop after a few minutes and I get something such as:

$ nodetool compactionstats
pending tasks: 128

So no compactions being executed (since there is no more write activity),
but the pending number is still high.

Thanks again


On Mon, Mar 21, 2016 at 3:19 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
wrote:

> > We added a bunch of new nodes to a cluster (2.1.13) and everything went
> fine, except for the number of pending compactions that is staying quite
> high on a subset of the new nodes. Over the past 3 days, the pending
> compactions have never been less than ~130 on such nodes, with peaks of
> ~200.
>
> When you bootstrap with Vnodes, you end up with thousands (or tens of
> thousands) of sstables – with 256 Vnodes (default) * 20 sstables per node,
> your resulting node will have 5k sstables. It takes quite a while for
> compaction to chew through that. If you added a bunch of nodes in sequence,
> you’d have 5k on the first node, then potentailly 10k on the next, and
> could potentially keep increasing as you start streaming from nodes that
> have way too many sstables.  This is one of the reasons that many people
> who have to grow their clusters frequently try not to use vnodes.
>
> From your other email:
>
> > Also related to this point, now I'm seeing something even more odd:
> some compactions are way bigger than the size of the column family itself,
> such as:
>
> The size reported by compactionstats is the uncompressed size – if you’re
> using compression, it’s perfectly reasonable for 30G of data to show up as
> 118G of data during compaction.
>
> - Jeff
>
> From: Gianluca Borello
> Reply-To: "user@cassandra.apache.org"
> Date: Monday, March 21, 2016 at 12:50 PM
> To: "user@cassandra.apache.org"
> Subject: Pending compactions not going down on some nodes of the cluster
>
> Hi,
>
> We added a bunch of new nodes to a cluster (2.1.13) and everything went
> fine, except for the number of pending compactions that is staying quite
> high on a subset of the new nodes. Over the past 3 days, the pending
> compactions have never been less than ~130 on such nodes, with peaks of
> ~200. On the other nodes, they correctly fluctuate between 0 and ~20, which
> has been our norm for a long time.
>
> We are quite paranoid about pending compactions because in the past such
> high number caused a lot of data being brought in memory during some reads
> and that triggered a chain reaction of full GCs that brought down our
> cluster, so we try to monitor them closely.
>
> Some data points that should let the situation speak for itself:
>
> - We use LCS for all our column families
>
> - The cluster is operating absolutely fine and seems healthy, and every
> node is handling pretty much the same load in terms of reads and writes.
> Also, these nodes with higher pending compactions don't seem in any way
> performing worse than the others
>
> - The pending compactions don't go down even when setting the compaction
> throughput to unlimited for a very long time
>
> - This is the typical output of compactionstats and tpstats:
>
> $ nodetool compactionstats
> pending tasks: 137
>compaction type   keyspacetable completed total
>unit   progress
> Compaction draios   message_data6061112083946939536890
>   bytes 88.06%
> Compaction draiosmessage_data1   26473390790   37243294809
>   bytes 71.08%
> Active compaction remaining time :n/a
>
> $ nodetool tpstats
> Pool NameActive   Pending  Completed   Blocked
>  All time blocked
> CounterMutationStage  0 0  0 0
> 0
> ReadStage 1 0  111766844 0
> 0
> RequestResponseStage 

Re: Pending compactions not going down on some nodes of the cluster

2016-03-21 Thread Gianluca Borello
On Mon, Mar 21, 2016 at 12:50 PM, Gianluca Borello <gianl...@sysdig.com>
wrote:

>
> - It's also interesting to notice how the compaction in the previous
> example is trying to compact ~37 GB, which is essentially the whole size of
> the column family message_data1 as reported by cfstats:
>

Also related to this point, now I'm seeing something even more odd: some
compactions are way bigger than the size of the column family itself, such
as:

$ nodetool compactionstats -H
pending tasks: 110
   compaction type   keyspace   table   completed   total
 unit   progress
Compaction draios   message_data126.28 GB   118.73 GB
bytes 22.13%
Active compaction remaining time :   0h04m30s

It says the total is 118.73 GB, but that column family never got anywhere
close to that size, it has always stayed around ~30GB:

$ du -hs /raid0/cassandra/data/draios/message_data1-*
35G
/raid0/cassandra/data/draios/message_data1-ad87f550ea2b11e5a528dde586fa678e


Re: Pending compactions not going down on some nodes of the cluster

2016-03-21 Thread Gianluca Borello
On Mon, Mar 21, 2016 at 2:15 PM, Alain RODRIGUEZ  wrote:

>
> What hardware do you use? Can you see it running at the limits (CPU /
> disks IO)? Is there any error on system logs, are disks doing fine?
>
>
Nodes are c3.2xlarge instances on AWS. The nodes are relatively idle, and,
as said in the original email, the other nodes are handling the same load
and are doing just fine (no warnings or errors in the system and cassandra
logs). I mean, even this node is doing very fine, it seems as if there is
some internal bug in the way pending tasks are counted.


> How is configured the concurrent_compactors setting? It looks like you are
> using 2 from 'nodetool compactionstats' output. Could you increase that
> or is it the number of cores of the machine?
>
>
It's 2. I can certainly increase it, but, as I proved earlier, it doesn't
do anything, because if I do a "nodetool drain" and wait a while, the
compactions completely stop (and the node goes totally idle), despite the
pending compactions remaining high, so this really doesn't feel like a
problem of Cassandra not being able to keep up with the compactions.


Pending compactions not going down on some nodes of the cluster

2016-03-21 Thread Gianluca Borello
Hi,

We added a bunch of new nodes to a cluster (2.1.13) and everything went
fine, except for the number of pending compactions that is staying quite
high on a subset of the new nodes. Over the past 3 days, the pending
compactions have never been less than ~130 on such nodes, with peaks of
~200. On the other nodes, they correctly fluctuate between 0 and ~20, which
has been our norm for a long time.

We are quite paranoid about pending compactions because in the past such
high number caused a lot of data being brought in memory during some reads
and that triggered a chain reaction of full GCs that brought down our
cluster, so we try to monitor them closely.

Some data points that should let the situation speak for itself:

- We use LCS for all our column families

- The cluster is operating absolutely fine and seems healthy, and every
node is handling pretty much the same load in terms of reads and writes.
Also, these nodes with higher pending compactions don't seem in any way
performing worse than the others

- The pending compactions don't go down even when setting the compaction
throughput to unlimited for a very long time

- This is the typical output of compactionstats and tpstats:

$ nodetool compactionstats
pending tasks: 137
   compaction type   keyspacetable completed total
   unit   progress
Compaction draios   message_data6061112083946939536890
  bytes 88.06%
Compaction draiosmessage_data1   26473390790   37243294809
  bytes 71.08%
Active compaction remaining time :n/a

$ nodetool tpstats
Pool NameActive   Pending  Completed   Blocked  All
time blocked
CounterMutationStage  0 0  0 0
0
ReadStage 1 0  111766844 0
0
RequestResponseStage  0 0  244259493 0
0
MutationStage 0 0  163268653 0
0
ReadRepairStage   0 08933323 0
0
GossipStage   0 0 363003 0
0
CacheCleanupExecutor  0 0  0 0
0
AntiEntropyStage  0 0  0 0
0
MigrationStage0 0  2 0
0
Sampler   0 0  0 0
0
ValidationExecutor0 0  0 0
0
CommitLogArchiver 0 0  0 0
0
MiscStage 0 0  0 0
0
MemtableFlushWriter   0 0  32644 0
0
MemtableReclaimMemory 0 0  32644 0
0
PendingRangeCalculator0 0527 0
0
MemtablePostFlush 0 0  36565 0
0
CompactionExecutor270 108621 0
0
InternalResponseStage 0 0  0 0
0
HintedHandoff 0 0 10 0
0
Native-Transport-Requests 6 0  188996929 0
79122

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  0
PAGED_RANGE  0
BINARY   0
READ 0
MUTATION 0
_TRACE   0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0

- If I do a nodetool drain on such nodes, and then wait for a while, the
number of pending compactions stays high even if there are no compactions
being executed anymore and the node is completely idle:

$ nodetool compactionstats
pending tasks: 128

- It's also interesting to notice how the compaction in the previous
example is trying to compact ~37 GB, which is essentially the whole size of
the column family message_data1 as reported by cfstats:

$ nodetool cfstats -H draios.message_data1
Keyspace: draios
Read Count: 208168
Read Latency: 2.4791508685292647 ms.
Write Count: 502529
Write Latency: 0.20701542000561163 ms.
Pending Flushes: 0
Table: message_data1
SSTable count: 261
SSTables in each level: [43/4, 92/10, 125/100, 0, 0, 0, 0, 0, 0]
Space used (live): 36.98 GB
Space used (total): 36.98 GB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 36.21 MB
SSTable Compression Ratio: 0.15461126176169512
Number of keys (estimate): 101025
Memtable cell count: 229344
Memtable data size: 82.4 MB
Memtable off heap memory used: 0 bytes
Memtable switch count: 83
Local read count: 208225
Local read latency: 2.479 ms
Local write count: 502581
Local write latency: 0.208 ms
Pending 

Re: Unexpected high internode network activity

2016-02-26 Thread Gianluca Borello
Thank you for your reply.

- Repairs are not running on the cluster, in fact we've been "slacking"
when it comes to repair, mainly because we never manually delete our data
as it's always TTLed and we haven't had major failures or outages that
required repairing data (I know that's not a good reason anyway)

- We are not using server-to-server encryption

- internode_compression is set to all, and the application driver is lz4

- I just did a "nodetool flush && service cassandra restart" on one node of
the affected cluster and let it run for a few minutes, and these are the
statistics (all the nodes get the same ratio of network activity on port
9042 and port 7000, so pardon my raw estimates below in assuming that the
activity of a single node can reflect the activity of the whole cluster):

9042 traffic: 400 MB (split between 200 MB reads and 200 MB writes)
7000 traffic: 5 GB (counted twice by iftop, so 2.5 GB)

$ nodetool netstats -H
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 10167
Mismatch (Blocking): 210
Mismatch (Background): 151
Pool NameActive   Pending  Completed
Commandsn/a 0 422986
Responses   n/a 0 403144

If I do the same on a test cluster (with less activity and nodes but same
RF and configuration), I get, always for a single node:

9042 traffic: 250 MB (split between 100 MB reads and 150 MB writes)
7000 traffic: 1 GB (counted twice by iftop, so 500 MB)

$ nodetool netstats -H
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 6668
Mismatch (Blocking): 159
Mismatch (Background): 43
Pool NameActive   Pending  Completed
Commandsn/a 0 125202
Responses   n/a 0 141708

So, once again, in one cluster the internode activity is ~7 times the 9042
one, whereas in the test one is ~2, which is expected.

Thanks


On Fri, Feb 26, 2016 at 10:04 AM, Nate McCall 
wrote:

>
>> Unfortunately, these numbers still don't match at all.
>>
>> And yes, the cluster is in a single DC and since I am using the EC2
>> snitch, replicas are AZ aware.
>>
>>
> Are repairs running on the cluster?
>
> Other thoughts:
> - is internode_compression set to 'all' in cassandra.yaml (should be 'all'
> by default, but worth checking since you are using lz4 on the client)?
> - are you using server-to-server encryption ?
>
> You can compare the output of nodetool netstats on the test cluster with
> the AWS cluster as well to see if anything sticks out.
>
>
> --
> -
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>


Re: Unexpected high internode network activity

2016-02-26 Thread Gianluca Borello
I understand your point about the billing, but billing here was merely
the triggering factor that had me start analyzing the traffic in the first
place.

At the moment, I'm not considering the numbers on my bill anymore but
simply the numbers that I am measuring with iftop on each node of the
cluster, and if I measure the total traffic on port 7000 I see 35 GB in the
example above, and since each byte is counted twice by iftop (because I'm
running on every node) the cluster generated 17.5 GB of unique network
activity, and I am trying to explain that number in relation to the traffic
I'm seeing on port 9042, billing aside.

Unfortunately, these numbers still don't match at all.

And yes, the cluster is in a single DC and since I am using the EC2 snitch,
replicas are AZ aware.

Thanks

On Thursday, February 25, 2016, daemeon reiydelle <daeme...@gmail.com>
wrote:

> Hmm. From the AWS FAQ:
>
> *Q: If I have two instances in different availability zones, how will I be
> charged for regional data transfer?*
>
> Each instance is charged for its data in and data out. Therefore, if data
> is transferred between these two instances, it is charged out for the first
> instance and in for the second instance.
>
>
> I really am not seeing this factored into your numbers fully. If data
> transfer is only twice as much as expected, the above billing would seem to
> put the numbers in line. Since (I assume) you have one copy in EACH AZ (dc
> aware but really dc=az) I am not seeing the bandwidth as that much out of
> line.
>
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
>
> On Thu, Feb 25, 2016 at 11:00 PM, Gianluca Borello <gianl...@sysdig.com
> <javascript:_e(%7B%7D,'cvml','gianl...@sysdig.com');>> wrote:
>
>> It is indeed very intriguing and I really hope to learn more from the
>> experience of this mailing list. To address your points:
>>
>> - The theory that full data is coming from replicas during reads is not
>> enough to explain the situation. In my scenario, over a time window I had
>> 17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
>> reads (measured on port 9042), so even if both reads and writes affected
>> all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
>> port 7000 unaccounted
>>
>> - We are doing regular backups the standard way, using periodic snapshots
>> and synchronizing them to S3. This traffic is not part of the anomalous
>> traffic we're seeing above, since this one goes on port 80 and it's clearly
>> visible with a separate bpf filter, and its magnitude is far lower than
>> that anyway
>>
>> Thanks
>>
>> On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle <daeme...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','daeme...@gmail.com');>> wrote:
>>
>>> Intriguing. It's enough data to look like full data is coming from the
>>> replicants instead of digests when the read of the copy occurs. Are you
>>> doing backup/dr? Are directories copied regularly and over the network or ?
>>>
>>>
>>> *...*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gianl...@sysdig.com
>>> <javascript:_e(%7B%7D,'cvml','gianl...@sysdig.com');>> wrote:
>>>
>>>> Thank you for your reply.
>>>>
>>>> To answer your points:
>>>>
>>>> - I fully agree on the write volume, in fact my isolated tests confirm
>>>> your estimation
>>>>
>>>> - About the read, I agree as well, but the volume of data is still much
>>>> higher
>>>>
>>>> - I am writing to one single keyspace with RF 3, there's just one
>>>> keyspace
>>>>
>>>> - I am not using any indexes, the column families are very simple
>>>>
>>>> - I am aware of the double count, in fact, I measured the traffic on
>>>> port 9042 at the client side (so just counted once) and I divided by two
>>>> the traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All
>>>> the measurements have been done with iftop with proper bpf filters on the
>>>> port and the total traffic matches what I see in cloudwatch (divided by 
>>>> two)
>>>>
>>>> So unfortunately I still don't have any ideas about what's going on and
>>>> why I'm s

Re: Unexpected high internode network activity

2016-02-25 Thread Gianluca Borello
It is indeed very intriguing and I really hope to learn more from the
experience of this mailing list. To address your points:

- The theory that full data is coming from replicas during reads is not
enough to explain the situation. In my scenario, over a time window I had
17.5 GB of intra node activity (port 7000) for 1 GB of writes and 1.5 GB of
reads (measured on port 9042), so even if both reads and writes affected
all replicas, I would have (1 + 1.5) * 3 = 7.5 GB, still leaving 10 GB on
port 7000 unaccounted

- We are doing regular backups the standard way, using periodic snapshots
and synchronizing them to S3. This traffic is not part of the anomalous
traffic we're seeing above, since this one goes on port 80 and it's clearly
visible with a separate bpf filter, and its magnitude is far lower than
that anyway

Thanks

On Thu, Feb 25, 2016 at 9:03 PM, daemeon reiydelle <daeme...@gmail.com>
wrote:

> Intriguing. It's enough data to look like full data is coming from the
> replicants instead of digests when the read of the copy occurs. Are you
> doing backup/dr? Are directories copied regularly and over the network or ?
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Thu, Feb 25, 2016 at 8:12 PM, Gianluca Borello <gianl...@sysdig.com>
> wrote:
>
>> Thank you for your reply.
>>
>> To answer your points:
>>
>> - I fully agree on the write volume, in fact my isolated tests confirm
>> your estimation
>>
>> - About the read, I agree as well, but the volume of data is still much
>> higher
>>
>> - I am writing to one single keyspace with RF 3, there's just one
>> keyspace
>>
>> - I am not using any indexes, the column families are very simple
>>
>> - I am aware of the double count, in fact, I measured the traffic on port
>> 9042 at the client side (so just counted once) and I divided by two the
>> traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
>> measurements have been done with iftop with proper bpf filters on the
>> port and the total traffic matches what I see in cloudwatch (divided by two)
>>
>> So unfortunately I still don't have any ideas about what's going on and
>> why I'm seeing 17 GB of internode traffic instead of ~ 5-6.
>>
>> On Thursday, February 25, 2016, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> If read & write at quorum then you write 3 copies of the data then
>>> return to the caller; when reading you read one copy (assume it is not on
>>> the coordinator), and 1 digest (because read at quorum is 2, not 3).
>>>
>>> When you insert, how many keyspaces get written to? (Are you using e.g.
>>> inverted indices?) That is my guess, that your db has about 1.8 bytes
>>> written for every byte inserted.
>>>
>>> ​Every byte you write is counted also as a read (system a sends 1gb to
>>> system b, so system b receives 1gb). You would not be charged if intra AZ,
>>> but inter AZ and inter DC will get that double count.
>>>
>>> So, my guess is reverse indexes, and you forgot to include receive and
>>> transmit.​
>>> ​
>>>
>>>
>>> *...*
>>>
>>>
>>>
>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>
>>> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gianl...@sysdig.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>>>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>>>> c3.2xlarge instances.
>>>>
>>>> The configuration is pretty standard, we use the default settings that
>>>> come with the datastax AMI and the driver in our application is configured
>>>> to use lz4 compression. The keyspace where all the activity happens has RF
>>>> 3 and we read and write at quorum to get strong consistency.
>>>>
>>>> While analyzing our monthly bill, we noticed that the amount of network
>>>> traffic related to Cassandra was significantly higher than expected. After
>>>> breaking it down by port, it seems like over any given time, the internode
>>>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>>>> we would expect something aro

Re: Unexpected high internode network activity

2016-02-25 Thread Gianluca Borello
Thank you for your reply.

To answer your points:

- I fully agree on the write volume, in fact my isolated tests confirm
your estimation

- About the read, I agree as well, but the volume of data is still much
higher

- I am writing to one single keyspace with RF 3, there's just one keyspace

- I am not using any indexes, the column families are very simple

- I am aware of the double count, in fact, I measured the traffic on port
9042 at the client side (so just counted once) and I divided by two the
traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
measurements have been done with iftop with proper bpf filters on the
port and the total traffic matches what I see in cloudwatch (divided by two)

So unfortunately I still don't have any ideas about what's going on and why
I'm seeing 17 GB of internode traffic instead of ~ 5-6.

On Thursday, February 25, 2016, daemeon reiydelle <daeme...@gmail.com>
wrote:

> If read & write at quorum then you write 3 copies of the data then return
> to the caller; when reading you read one copy (assume it is not on the
> coordinator), and 1 digest (because read at quorum is 2, not 3).
>
> When you insert, how many keyspaces get written to? (Are you using e.g.
> inverted indices?) That is my guess, that your db has about 1.8 bytes
> written for every byte inserted.
>
> ​Every byte you write is counted also as a read (system a sends 1gb to
> system b, so system b receives 1gb). You would not be charged if intra AZ,
> but inter AZ and inter DC will get that double count.
>
> So, my guess is reverse indexes, and you forgot to include receive and
> transmit.​
> ​
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
>
> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gianl...@sysdig.com
> <javascript:_e(%7B%7D,'cvml','gianl...@sysdig.com');>> wrote:
>
>> Hello,
>>
>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>> c3.2xlarge instances.
>>
>> The configuration is pretty standard, we use the default settings that
>> come with the datastax AMI and the driver in our application is configured
>> to use lz4 compression. The keyspace where all the activity happens has RF
>> 3 and we read and write at quorum to get strong consistency.
>>
>> While analyzing our monthly bill, we noticed that the amount of network
>> traffic related to Cassandra was significantly higher than expected. After
>> breaking it down by port, it seems like over any given time, the internode
>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>> we would expect something around 2-3 times, given the replication factor
>> and the consistency level of our queries.
>>
>> For example, this is the network traffic broken down by port and
>> direction over a few minutes, measured as sum of each node:
>>
>> Port 9042 from client to cluster (write queries): 1 GB
>> Port 9042 from cluster to client (read queries): 1.5 GB
>> Port 7000: 35 GB, which must be divided by two because the traffic is
>> always directed to another instance of the cluster, so that makes it 17.5
>> GB generated traffic
>>
>> The traffic on port 9042 completely matches our expectations, we do about
>> 100k write operations writing 10KB binary blobs for each query, and a bit
>> more reads on the same data.
>>
>> According to our calculations, in the worst case, when the coordinator of
>> the query is not a replica for the data, this should generate about (1 +
>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>
>> Also, hinted handoffs are disabled and nodes are healthy over the period
>> of observation, and I get the same numbers across pretty much every time
>> window, even including an entire 24 hours period.
>>
>> I tried to replicate this problem in a test environment so I connected a
>> client to a test cluster done in a bunch of Docker containers (same
>> parameters, essentially the only difference is the
>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>> amount of traffic on port 9042 and the queries are pretty much the same
>> ones.
>>
>> Before doing more analysis, I was wondering if someone has an explanation
>> on this problem, since perhaps we are missing something obvious here?
>>
>> Thanks
>>
>>
>>
>


Unexpected high internode network activity

2016-02-25 Thread Gianluca Borello
Hello,

We have a Cassandra 2.1.9 cluster on EC2 for one of our live applications.
There's a total of 21 nodes across 3 AWS availability zones, c3.2xlarge
instances.

The configuration is pretty standard, we use the default settings that come
with the datastax AMI and the driver in our application is configured to
use lz4 compression. The keyspace where all the activity happens has RF 3
and we read and write at quorum to get strong consistency.

While analyzing our monthly bill, we noticed that the amount of network
traffic related to Cassandra was significantly higher than expected. After
breaking it down by port, it seems like over any given time, the internode
network activity is 6-7 times higher than the traffic on port 9042, whereas
we would expect something around 2-3 times, given the replication factor
and the consistency level of our queries.

For example, this is the network traffic broken down by port and direction
over a few minutes, measured as sum of each node:

Port 9042 from client to cluster (write queries): 1 GB
Port 9042 from cluster to client (read queries): 1.5 GB
Port 7000: 35 GB, which must be divided by two because the traffic is
always directed to another instance of the cluster, so that makes it 17.5
GB generated traffic

The traffic on port 9042 completely matches our expectations, we do about
100k write operations writing 10KB binary blobs for each query, and a bit
more reads on the same data.

According to our calculations, in the worst case, when the coordinator of
the query is not a replica for the data, this should generate about (1 +
1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.

Also, hinted handoffs are disabled and nodes are healthy over the period of
observation, and I get the same numbers across pretty much every time
window, even including an entire 24 hours period.

I tried to replicate this problem in a test environment so I connected a
client to a test cluster done in a bunch of Docker containers (same
parameters, essentially the only difference is the
GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
expect, the amount of traffic on port 7000 is between 2 and 3 times the
amount of traffic on port 9042 and the queries are pretty much the same
ones.

Before doing more analysis, I was wondering if someone has an explanation
on this problem, since perhaps we are missing something obvious here?

Thanks


Re: Performance issues with "many" CQL columns

2016-02-14 Thread Gianluca Borello
Considering the (simplified) table that I wrote before:

create table data (
id bigint,
ts bigint,
column1 blob,
column2 blob,
column3 blob,
...
column29 blob,
column30 blob
primary key (id, ts)

A user request (varies every time) translates into a set of queries asking
a subset of the columns (< 10) for a specific set of sensors (< 100) for a
specific time range (< 300):

SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_1 and ts > x and ts < y)
SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_2 and ts > x and ts < y)
...
SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_N and ts > x and ts < y)

To answer your question, each non-EQ predicate on timestamp returns a few
hundreds rows (it's essentially a time series).

If I put the column number as clustering key with the timestamp, I'll have
to further increase the number of queries and make my code more complicated:

SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 1)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 7)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 20)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 25)
...

Again, not too terrible and I'll definitely have to do something similar
because the performance penalty I'm paying now is very significant, but by
all means this seems to me a complication in the data model (and in my
application).

Thanks again


On Sun, Feb 14, 2016 at 5:21 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> What does your query actually look like today?
>
> Is your non-EQ on timestamp selecting a single row a few rows or many rows
> (dozens, hundreds, thousands)?
>
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 7:40 PM, Gianluca Borello <gianl...@sysdig.com>
> wrote:
>
>> Thanks again.
>>
>> One clarification about "reading in a single SELECT": in my point 2, I
>> mentioned the need to read a variable subset of columns every time, usually
>> in the range of ~5 out of 30. I can't find a way to do that in a single
>> SELECT unless I use the IN operator (which I can't, as explained).
>>
>> Is there any other method you were thinking of, or your "reading in a
>> single SELECT" is just applicable when I need to read the whole set of
>> columns (which is never my case, unfortunately)?
>>
>> Thanks
>>
>>
>> On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <jack.krupan...@gmail.com
>> > wrote:
>>
>>> You can definitely read all of columns in a single SELECT. And the
>>> n-INSERTS can be batched and will insert fewer cells in the storage engine
>>> than the previous approach.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gianl...@sysdig.com>
>>> wrote:
>>>
>>>> Thank you for your reply.
>>>>
>>>> Your advice is definitely sound, although it still seems suboptimal to
>>>> me because:
>>>>
>>>> 1) It requires N INSERT queries from the application code (where N is
>>>> the number of columns)
>>>>
>>>> 2) It requires N SELECT queries from my application code (where N is
>>>> the number of columns I need to read at any given time, which is determined
>>>> at runtime). I can't even use the IN operator (e.g. WHERE column_number IN
>>>> (1, 2, 3, ...)) because I am already using a non-EQ relation on the
>>>> timestamp key and Cassandra restricts me to only one non-EQ relation.
>>>>
>>>> In summary, I can (and will) adapt my code to use a similar approach
>>>> despite everything, but the goal of my message was mainly to understand why
>>>> the jira issues I linked above are not full of dozens of "+1" comments.
>>>>
>>>> To me this really feels like a terrible performance issue that should
>>>> be fixed by default (or in the very worst case clearly documented), even
>>>> after understanding the motivation for reading all the columns in the CQL
>>>> row.
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <
>>>> jack.krupan...@gmail.com> wrote:
>>>>
>>>>> You could add the column number as an additional clustering key. And
>>>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>>>> and access (assuming ther

Re: Performance issues with "many" CQL columns

2016-02-14 Thread Gianluca Borello
Thanks again.

One clarification about "reading in a single SELECT": in my point 2, I
mentioned the need to read a variable subset of columns every time, usually
in the range of ~5 out of 30. I can't find a way to do that in a single
SELECT unless I use the IN operator (which I can't, as explained).

Is there any other method you were thinking of, or your "reading in a
single SELECT" is just applicable when I need to read the whole set of
columns (which is never my case, unfortunately)?

Thanks


On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> You can definitely read all of columns in a single SELECT. And the
> n-INSERTS can be batched and will insert fewer cells in the storage engine
> than the previous approach.
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gianl...@sysdig.com>
> wrote:
>
>> Thank you for your reply.
>>
>> Your advice is definitely sound, although it still seems suboptimal to me
>> because:
>>
>> 1) It requires N INSERT queries from the application code (where N is the
>> number of columns)
>>
>> 2) It requires N SELECT queries from my application code (where N is the
>> number of columns I need to read at any given time, which is determined at
>> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
>> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp
>> key and Cassandra restricts me to only one non-EQ relation.
>>
>> In summary, I can (and will) adapt my code to use a similar approach
>> despite everything, but the goal of my message was mainly to understand why
>> the jira issues I linked above are not full of dozens of "+1" comments.
>>
>> To me this really feels like a terrible performance issue that should be
>> fixed by default (or in the very worst case clearly documented), even after
>> understanding the motivation for reading all the columns in the CQL row.
>>
>> Thanks
>>
>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <jack.krupan...@gmail.com
>> > wrote:
>>
>>> You could add the column number as an additional clustering key. And
>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>> and access (assuming there is only  a single non-PK data column, the blob
>>> value.) You can then access (read or write) an individual column/blob or a
>>> slice of them.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianl...@sysdig.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I've just painfully discovered a "little" detail in Cassandra:
>>>> Cassandra touches all columns on a CQL select (related issues
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>>
>>>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>>>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>>>> interested in a subportion of that blob of data across an arbitrary period
>>>> of time. What I do is simply splitting those blobs of data in about 30
>>>> logical units and write them in a CQL table such as:
>>>>
>>>> create table data (
>>>> id bigint,
>>>> ts bigint,
>>>> column1 blob,
>>>> column2 blob,
>>>> column3 blob,
>>>> ...
>>>> column29 blob,
>>>> column30 blob
>>>> primary key (id, ts)
>>>>
>>>> id is a combination of the sensor id and a time bucket, in order to not
>>>> get the row too wide. Essentially, I thought this was a very legit data
>>>> model that helps me keep my application code very simple (because I can
>>>> work on a single table, I can write a split sensor blob in a single CQL
>>>> query and I can read a subset of the columns very efficiently with one
>>>> single CQL query).
>>>>
>>>> What I didn't realize is that Cassandra seems to always process all the
>>>> columns of the CQL row, regardless of the fact that my query asks just one
>>>> column, and this has dramatic effect on the performance of my reads.
>>>>
>>>> I wrote a simple isolated test case where I test how long it takes to
>>>> read one *single* column in a CQL table composed of severa

Re: Performance issues with "many" CQL columns

2016-02-14 Thread Gianluca Borello
Thank you for your reply.

Your advice is definitely sound, although it still seems suboptimal to me
because:

1) It requires N INSERT queries from the application code (where N is the
number of columns)

2) It requires N SELECT queries from my application code (where N is the
number of columns I need to read at any given time, which is determined at
runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
2, 3, ...)) because I am already using a non-EQ relation on the timestamp
key and Cassandra restricts me to only one non-EQ relation.

In summary, I can (and will) adapt my code to use a similar approach
despite everything, but the goal of my message was mainly to understand why
the jira issues I linked above are not full of dozens of "+1" comments.

To me this really feels like a terrible performance issue that should be
fixed by default (or in the very worst case clearly documented), even after
understanding the motivation for reading all the columns in the CQL row.

Thanks

On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> You could add the column number as an additional clustering key. And then
> you can actually use COMPACT STORAGE for even more efficient storage and
> access (assuming there is only  a single non-PK data column, the blob
> value.) You can then access (read or write) an individual column/blob or a
> slice of them.
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianl...@sysdig.com>
> wrote:
>
>> Hi
>>
>> I've just painfully discovered a "little" detail in Cassandra: Cassandra
>> touches all columns on a CQL select (related issues
>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>
>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>> interested in a subportion of that blob of data across an arbitrary period
>> of time. What I do is simply splitting those blobs of data in about 30
>> logical units and write them in a CQL table such as:
>>
>> create table data (
>> id bigint,
>> ts bigint,
>> column1 blob,
>> column2 blob,
>> column3 blob,
>> ...
>> column29 blob,
>> column30 blob
>> primary key (id, ts)
>>
>> id is a combination of the sensor id and a time bucket, in order to not
>> get the row too wide. Essentially, I thought this was a very legit data
>> model that helps me keep my application code very simple (because I can
>> work on a single table, I can write a split sensor blob in a single CQL
>> query and I can read a subset of the columns very efficiently with one
>> single CQL query).
>>
>> What I didn't realize is that Cassandra seems to always process all the
>> columns of the CQL row, regardless of the fact that my query asks just one
>> column, and this has dramatic effect on the performance of my reads.
>>
>> I wrote a simple isolated test case where I test how long it takes to
>> read one *single* column in a CQL table composed of several columns (at
>> each iteration I add and populate 10 new columns), each filled with 1MB
>> blobs:
>>
>> 10 columns: 209 ms
>> 20 columns: 339 ms
>> 30 columns: 510 ms
>> 40 columns: 670 ms
>> 50 columns: 884 ms
>> 60 columns: 1056 ms
>> 70 columns: 1527 ms
>> 80 columns: 1503 ms
>> 90 columns: 1600 ms
>> 100 columns: 1792 ms
>>
>> In other words, even if the result set returned is exactly the same
>> across all these iteration, the response time increases linearly with the
>> size of the other columns, and this is really causing a lot of problems in
>> my application.
>>
>> By reading the JIRA issues, it seems like this is considered a very minor
>> optimization not worth the effort of fixing, so I'm asking: is my use case
>> really so anomalous that the horrible performance that I'm experiencing are
>> to be considered "expected" and need to be fixed with some painful column
>> family splitting and messy application code?
>>
>> Thanks
>>
>
>


Performance issues with "many" CQL columns

2016-02-14 Thread Gianluca Borello
Hi

I've just painfully discovered a "little" detail in Cassandra: Cassandra
touches all columns on a CQL select (related issues
https://issues.apache.org/jira/browse/CASSANDRA-6586,
https://issues.apache.org/jira/browse/CASSANDRA-6588,
https://issues.apache.org/jira/browse/CASSANDRA-7085).

My data model is fairly simple: I have a bunch of "sensors" reporting a
blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
interested in a subportion of that blob of data across an arbitrary period
of time. What I do is simply splitting those blobs of data in about 30
logical units and write them in a CQL table such as:

create table data (
id bigint,
ts bigint,
column1 blob,
column2 blob,
column3 blob,
...
column29 blob,
column30 blob
primary key (id, ts)

id is a combination of the sensor id and a time bucket, in order to not get
the row too wide. Essentially, I thought this was a very legit data model
that helps me keep my application code very simple (because I can work on a
single table, I can write a split sensor blob in a single CQL query and I
can read a subset of the columns very efficiently with one single CQL
query).

What I didn't realize is that Cassandra seems to always process all the
columns of the CQL row, regardless of the fact that my query asks just one
column, and this has dramatic effect on the performance of my reads.

I wrote a simple isolated test case where I test how long it takes to read
one *single* column in a CQL table composed of several columns (at each
iteration I add and populate 10 new columns), each filled with 1MB blobs:

10 columns: 209 ms
20 columns: 339 ms
30 columns: 510 ms
40 columns: 670 ms
50 columns: 884 ms
60 columns: 1056 ms
70 columns: 1527 ms
80 columns: 1503 ms
90 columns: 1600 ms
100 columns: 1792 ms

In other words, even if the result set returned is exactly the same across
all these iteration, the response time increases linearly with the size of
the other columns, and this is really causing a lot of problems in my
application.

By reading the JIRA issues, it seems like this is considered a very minor
optimization not worth the effort of fixing, so I'm asking: is my use case
really so anomalous that the horrible performance that I'm experiencing are
to be considered "expected" and need to be fixed with some painful column
family splitting and messy application code?

Thanks


Re: Error on nodetool cleanup

2015-02-28 Thread Gianluca Borello
Thanks a lot for pointing this out! Yes, a workaround would be very much
appreciated, or also an ETA for 2.0.13, so that I could decide whether or
not going for an officially unsupported 2.0.12 to 2.0.11 downgrade, since I
really need that cleanup.

Thanks
On Feb 27, 2015 10:53 PM, Jeff Wehrwein j...@refresh.io wrote:

 We had the exact same problem, and found this bug:
 https://issues.apache.org/jira/browse/CASSANDRA-8716.  It's fixed in
 2.0.13 (unreleased), but we haven't found a workaround for the interim.
 Please share if you find one!

 Thanks,
 Jeff

 On Fri, Feb 27, 2015 at 6:01 PM, Gianluca Borello gianl...@draios.com
 wrote:

 Hello,

 I have a cluster of four nodes running 2.0.12. I added one more node and
 then went on with the cleanup procedure on the other four nodes, but I get
 this error (the same error on each node):

  INFO [CompactionExecutor:10] 2015-02-28 01:55:15,097
 CompactionManager.java (line 619) Cleaned up to
 /raid0/cassandra/data/draios/protobuf86400/draios-protobuf86400-tmp-jb-432-Data.db.
  8,253,257 to 8,253,257 (~100% of original) bytes for 5 keys.  Time: 304ms.
  INFO [CompactionExecutor:10] 2015-02-28 01:55:15,100
 CompactionManager.java (line 563) Cleaning up
 SSTableReader(path='/raid0/cassandra/data/draios/protobuf86400/draios-protobuf86400-jb-431-Data.db')
 ERROR [CompactionExecutor:10] 2015-02-28 01:55:15,102
 CassandraDaemon.java (line 199) Exception in thread
 Thread[CompactionExecutor:10,1,main]
 java.lang.AssertionError: Memory was freed
 at
 org.apache.cassandra.io.util.Memory.checkPosition(Memory.java:259)
 at org.apache.cassandra.io.util.Memory.getInt(Memory.java:211)
 at
 org.apache.cassandra.io.sstable.IndexSummary.getIndex(IndexSummary.java:79)
 at
 org.apache.cassandra.io.sstable.IndexSummary.getKey(IndexSummary.java:84)
 at
 org.apache.cassandra.io.sstable.IndexSummary.binarySearch(IndexSummary.java:58)
 at
 org.apache.cassandra.io.sstable.SSTableReader.getIndexScanPosition(SSTableReader.java:602)
 at
 org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:947)
 at
 org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:910)
 at
 org.apache.cassandra.io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:819)
 at
 org.apache.cassandra.db.ColumnFamilyStore.getExpectedCompactedFileSize(ColumnFamilyStore.java:1088)
 at
 org.apache.cassandra.db.compaction.CompactionManager.doCleanupCompaction(CompactionManager.java:564)
 at
 org.apache.cassandra.db.compaction.CompactionManager.access$400(CompactionManager.java:63)
 at
 org.apache.cassandra.db.compaction.CompactionManager$5.perform(CompactionManager.java:281)
 at
 org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:225)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
  INFO [FlushWriter:1] 2015-02-28 01:55:15,111 Memtable.java (line 398)
 Completed flushing
 /raid0/cassandra/data/draios/mounted_fs_by_agent1/draios-mounted_fs_by_agent1-jb-132895-Data.db
 (2513856 bytes) for commitlog position
 ReplayPosition(segmentId=1425088070445, position=2041)

 This happens with all column families, and they are not particularly big
 if that matters.

 How can I reclaim the free space for which I expanded the cluster in the
 first place?

 Thank you





Error on nodetool cleanup

2015-02-27 Thread Gianluca Borello
Hello,

I have a cluster of four nodes running 2.0.12. I added one more node and
then went on with the cleanup procedure on the other four nodes, but I get
this error (the same error on each node):

 INFO [CompactionExecutor:10] 2015-02-28 01:55:15,097
CompactionManager.java (line 619) Cleaned up to
/raid0/cassandra/data/draios/protobuf86400/draios-protobuf86400-tmp-jb-432-Data.db.
 8,253,257 to 8,253,257 (~100% of original) bytes for 5 keys.  Time: 304ms.
 INFO [CompactionExecutor:10] 2015-02-28 01:55:15,100
CompactionManager.java (line 563) Cleaning up
SSTableReader(path='/raid0/cassandra/data/draios/protobuf86400/draios-protobuf86400-jb-431-Data.db')
ERROR [CompactionExecutor:10] 2015-02-28 01:55:15,102 CassandraDaemon.java
(line 199) Exception in thread Thread[CompactionExecutor:10,1,main]
java.lang.AssertionError: Memory was freed
at
org.apache.cassandra.io.util.Memory.checkPosition(Memory.java:259)
at org.apache.cassandra.io.util.Memory.getInt(Memory.java:211)
at
org.apache.cassandra.io.sstable.IndexSummary.getIndex(IndexSummary.java:79)
at
org.apache.cassandra.io.sstable.IndexSummary.getKey(IndexSummary.java:84)
at
org.apache.cassandra.io.sstable.IndexSummary.binarySearch(IndexSummary.java:58)
at
org.apache.cassandra.io.sstable.SSTableReader.getIndexScanPosition(SSTableReader.java:602)
at
org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:947)
at
org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:910)
at
org.apache.cassandra.io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:819)
at
org.apache.cassandra.db.ColumnFamilyStore.getExpectedCompactedFileSize(ColumnFamilyStore.java:1088)
at
org.apache.cassandra.db.compaction.CompactionManager.doCleanupCompaction(CompactionManager.java:564)
at
org.apache.cassandra.db.compaction.CompactionManager.access$400(CompactionManager.java:63)
at
org.apache.cassandra.db.compaction.CompactionManager$5.perform(CompactionManager.java:281)
at
org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:225)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
 INFO [FlushWriter:1] 2015-02-28 01:55:15,111 Memtable.java (line 398)
Completed flushing
/raid0/cassandra/data/draios/mounted_fs_by_agent1/draios-mounted_fs_by_agent1-jb-132895-Data.db
(2513856 bytes) for commitlog position
ReplayPosition(segmentId=1425088070445, position=2041)

This happens with all column families, and they are not particularly big if
that matters.

How can I reclaim the free space for which I expanded the cluster in the
first place?

Thank you


Re: Wide rows best practices and GC impact

2014-12-03 Thread Gianluca Borello
Thanks Robert, I really appreciate your help!

I'm still unsure why Cassandra 2.1 seem to perform much better in that same
scenario (even setting the same values of compaction threshold and number
of compactors), but I guess we'll revise when we'll decide to upgrade 2.1
in production.

On Dec 3, 2014 6:33 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Dec 2, 2014 at 5:01 PM, Gianluca Borello gianl...@draios.com
wrote:

 We mainly store time series-like data, where each data point is a binary
blob of 5-20KB. We use wide rows, and try to put in the same row all the
data that we usually need in a single query (but not more than that). As a
result, our application logic is very simple (since we have to do just one
query to read the data on average) and read/write response times are very
satisfactory. This is a cfhistograms and a cfstats of our heaviest CF:


 100mb is not HYOOOGE but is around the size where large rows can cause
heap pressure.

 You seem to be unclear on the implications of pending compactions,
however.

 Briefly, pending compactions indicate that you have more SSTables than
you should. As compaction both merges row versions and reduces the number
of SSTables, a high number of pending compactions causes problems
associated with both having too many row versions (fragmentation) and a
large number of SSTables (per-SSTable heap/memory (depending on version)
overhead like bloom filters and index samples). In your case, it seems the
problem is probably just the compaction throttle being too low.

 My conjecture is that, given your normal data size and read/write
workload, you are relatively close to GC pre-fail when compaction is
working. When it stops working, you relatively quickly get into a state
where you exhaust heap because you have too many SSTables.

 =Rob
 http://twitter.com/rcolidba
 PS - Given 30GB of RAM on the machine, you could consider investigating
large-heap configurations, rbranson from Instagram has some slides out
there on the topic. What you pay is longer stop the world GCs, IOW latency
if you happen to be talking to a replica node when it pauses.



Wide rows best practices and GC impact

2014-12-02 Thread Gianluca Borello
Hi,

We have a cluster (2.0.11) of 6 nodes (RF=3), c3.4xlarge instances, about
50 column families. Cassandra heap takes 8GB out of the 30GB of every
instance.

We mainly store time series-like data, where each data point is a binary
blob of 5-20KB. We use wide rows, and try to put in the same row all the
data that we usually need in a single query (but not more than that). As a
result, our application logic is very simple (since we have to do just one
query to read the data on average) and read/write response times are very
satisfactory. This is a cfhistograms and a cfstats of our heaviest CF:

SSTables per Read
1 sstables: 3198856
2 sstables: 45

Write Latency (microseconds)
  4 us: 37
  5 us: 1247
  6 us: 9987
  7 us: 31442
  8 us: 66121
 10 us: 400503
 12 us: 1158329
 14 us: 2873934
 17 us: 11843616
 20 us: 24464275
 24 us: 30574717
 29 us: 24351624
 35 us: 16788801
 42 us: 3935374
 50 us: 797781
 60 us: 272160
 72 us: 121819
 86 us: 64641
103 us: 41085
124 us: 33618
149 us: 199463
179 us: 255445
215 us: 38238
258 us: 12300
310 us: 5307
372 us: 3180
446 us: 2443
535 us: 1773
642 us: 1314
770 us: 991
924 us: 748
   1109 us: 606
   1331 us: 465
   1597 us: 433
   1916 us: 453
   2299 us: 484
   2759 us: 983
   3311 us: 976
   3973 us: 338
   4768 us: 312
   5722 us: 237
   6866 us: 198
   8239 us: 163
   9887 us: 138
  11864 us: 115
  14237 us: 231
  17084 us: 550
  20501 us: 603
  24601 us: 635
  29521 us: 875
  35425 us: 731
  42510 us: 497
  51012 us: 476
  61214 us: 347
  73457 us: 331
  88148 us: 273
 105778 us: 143
 126934 us: 92
 152321 us: 47
 182785 us: 16
 219342 us: 5
 263210 us: 2
 315852 us: 2
 379022 us: 1
 454826 us: 1
 545791 us: 1
 654949 us: 0
 785939 us: 0
 943127 us: 1
1131752 us: 1

Read Latency (microseconds)
 20 us: 1
 24 us: 9
 29 us: 18
 35 us: 96
 42 us: 6989
 50 us: 113305
 60 us: 552348
 72 us: 772329
 86 us: 654019
103 us: 578404
124 us: 300364
149 us: 111522
179 us: 37385
215 us: 18353
258 us: 10733
310 us: 7915
372 us: 9406
446 us: 7645
535 us: 2773
642 us: 1323
770 us: 1351
924 us: 953
   1109 us: 857
   1331 us: 1122
   1597 us: 800
   1916 us: 806
   2299 us: 686
   2759 us: 581
   3311 us: 671
   3973 us: 318
   4768 us: 318
   5722 us: 226
   6866 us: 164
   8239 us: 161
   9887 us: 134
  11864 us: 125
  14237 us: 184
  17084 us: 285
  20501 us: 315
  24601 us: 378
  29521 us: 431
  35425 us: 468
  42510 us: 469
  51012 us: 466
  61214 us: 407
  73457 us: 337
  88148 us: 297
 105778 us: 242
 126934 us: 135
 152321 us: 109
 182785 us: 57
 219342 us: 41
 263210 us: 28
 315852 us: 16
 379022 us: 12
 454826 us: 6
 545791 us: 6
 654949 us: 0
 785939 us: 0
 943127 us: 0
1131752 us: 2

Partition Size (bytes)
3311 bytes: 1
3973 bytes: 2
4768 bytes: 0
5722 bytes: 2
6866 bytes: 0
8239 bytes: 0
9887 bytes: 2
   11864 bytes: 1
   14237 bytes: 0
   17084 bytes: 0
   20501 bytes: 0
   24601 bytes: 0
   29521 bytes: 3
   35425 bytes: 0
   42510 bytes: 1
   51012 bytes: 1
   61214 bytes: 1
   73457 bytes: 3
   88148 bytes: 1
  105778 bytes: 5
  126934 bytes: 2
  152321 bytes: 4
  182785 bytes: 65
  219342 bytes: 165
  263210 bytes: 268
  315852 bytes: 201
  379022 bytes: 30
  454826 bytes: 248
  545791 bytes: 16
  654949 bytes: 41
  785939 bytes: 259
  943127 bytes: 547
 1131752 bytes: 243
 1358102 bytes: 176
 1629722 bytes: 59
 1955666 bytes: 37
 2346799 bytes: 41
 2816159 bytes: 78
 3379391 bytes: 243
 4055269 bytes: 122
 4866323 bytes: 209
 5839588 bytes: 220
 7007506 bytes: 266
 8409007 bytes: 77
10090808 bytes: 103
12108970 bytes: 1
14530764 bytes: 2
17436917 bytes: 7
20924300 bytes: 410
25109160 bytes: 76

Cell Count per Partition
3 cells: 5
4 cells: 0
5 cells: 0
6 cells: 2
7 cells: 0
8 cells: 0
   10 cells: 2
   12 cells: 1
   14 cells: 0
   17 cells: 0
   20 cells: 1
   24 cells: 3
   29 cells: 1
   35 cells: 1
   42 cells: 0
   50 cells: 0
   60 cells: 3
   72 cells: 0
   86 cells: 1
  103 cells: 0
  124 cells: 11
  149 cells: 3
  179 cells: 4
  215 cells: 10
  258 cells: 13
  310 cells: 2181
  372 cells: 2
  446 cells: 2
  535 cells: 2
  642 cells: 4
  770 cells: 7
  924 cells: 488
 1109 cells: 3
 1331 cells: 24
 1597 cells: 143
 1916 cells: 332
 2299 cells: 2
 2759 cells: 5
 3311 cells: 483
 3973 cells: 0
 4768 cells: 2
 5722 cells: 1
 6866 cells: 1
 8239 cells: 0
 9887 cells: 2
11864 cells: 244
14237 cells: 1
17084 cells: 248
20501 cells: 1
24601 cells: 1
29521 cells: 1
35425 cells: 2
42510 cells: 1
51012 cells: 2
61214 cells: 237


Read Count: 3202919
Read Latency: 0.16807454013042478 ms.
Write Count: 118568574
Write Latency: 0.026566498615391967 ms.
Pending Tasks: 0
  Table: protobuf_by_agent1
  SSTable count: 49
  SSTables in each level: [1, 11/10, 37, 0, 0, 0, 0, 0, 0]
  Space used (live), bytes: 6934395462
  

How to avoid column family duplication (when query requires multiple restrictions)

2014-09-22 Thread Gianluca Borello
Hi,

I have a column family storing very large blobs that I would not like to
duplicate, if possible.
Here's a simplified version:

CREATE TABLE timeline (
   key text,
   a int,
   b int,
   value blob,
   PRIMARY KEY (key, a, b)
);

On this, I run exactly two types of query. Both of them must have a query
range on 'a', and just one must have 'b' restricted.

First query:

cqlsh SELECT * FROM timeline where key = 'event' and a = 2 and a = 3;

This one runs fine.

Second query:

cqlsh SELECT * FROM timeline where key = 'event' and a = 2 and a = 3 and
b = 12;
code=2200 [Invalid query] message=PRIMARY KEY column b cannot be
restricted (preceding column ColumnDefinition{name=a,
type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN,
componentIndex=0, indexName=null, indexType=null} is either not restricted
or by a non-EQ relation)

This fails. Even if I create an index:

CREATE INDEX timeline_b ON timeline (b);
cqlsh SELECT * FROM timeline where key = 'event' and a = 2 and a = 3 and
b = 12;
code=2200 [Invalid query] message=Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability, use
ALLOW FILTERING

I solved this problem by duplicating the column family (in timeline_by_a
and timeline_by_b where a and b are in opposite order), but I'm wondering
if there's a better solution, as this tends to grow pretty big.

In particular, from the little understanding that I have of the Cassandra
internals, it seems like even the second query should be fairly efficient
since the clustering columns are stored in order on disk, thus I don't
understand the ALLOW FILTERING requirement.

Another alternative that I'm thinking is just keeping another column family
that will serve as an index and I'll manually manage it in the
application.

Thanks.