Re: Compaction Strategy guidance

2014-11-23 Thread Andrei Ivanov
Stephane,

We are having a somewhat similar C* load profile. Hence some comments
in addition Nikolai's answer.
1. Fallback to STCS - you can disable it actually
2. Based on our experience, if you have a lot of data per node, LCS
may work just fine. That is, till the moment you decide to join
another node - chances are that the newly added node will not be able
to compact what it gets from old nodes. In your case, if you switch
strategy the same thing may happen. This is all due to limitations
mentioned by Nikolai.

Andrei,


On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com wrote:
 ABUSE



 YA NO QUIERO MAS MAILS SOY DE MEXICO



 De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
 Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
 Para: user@cassandra.apache.org
 Asunto: Re: Compaction Strategy guidance
 Importancia: Alta



 Stephane,

 As everything good, LCS comes at certain price.

 LCS will put most load on you I/O system (if you use spindles - you may need
 to be careful about that) and on CPU. Also LCS (by default) may fall back to
 STCS if it is falling behind (which is very possible with heavy writing
 activity) and this will result in higher disk space usage. Also LCS has
 certain limitation I have discovered lately. Sometimes LCS may not be able
 to use all your node's resources (algorithm limitations) and this reduces
 the overall compaction throughput. This may happen if you have a large
 column family with lots of data per node. STCS won't have this limitation.



 By the way, the primary goal of LCS is to reduce the number of sstables C*
 has to look at to find your data. With LCS properly functioning this number
 will be most likely between something like 1 and 3 for most of the reads.
 But if you do few reads and not concerned about the latency today, most
 likely LCS may only save you some disk space.



 On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com
 wrote:

 Hi there,



 use case:



 - Heavy write app, few reads.

 - Lots of updates of rows / columns.

 - Current performance is fine, for both writes and reads..

 - Currently using SizedCompactionStrategy



 We're trying to limit the amount of storage used during compaction. Should
 we switch to LeveledCompactionStrategy?



 Thanks




 --

 Nikolai Grigoriev
 (514) 772-5178


RE: A questiion to adding a new data center

2014-11-23 Thread Lu, Boying
This is what I want to know.

Thanks a lot ☺

From: Mark Reddy [mailto:mark.l.re...@gmail.com]
Sent: 2014年11月21日 18:07
To: user@cassandra.apache.org
Subject: Re: A questiion to adding a new data center

Hi Boying,

I'm not sure I fully understand your question here, so some clarification may 
be needed. However, if you are asking what steps need to be performed on the 
current datacenter or on the new datacenter:

Step 1 - Current DC
Step 2 - New DC
Step 3 - Depending on the snitch you may need to make changes on both the 
current and new DCs
Step 4 - Client config
Step 5 - Client config
Step 6 - New DC
Step 7 - New DC
Step 8 - New DC


Mark

On 21 November 2014 03:27, Lu, Boying 
boying...@emc.commailto:boying...@emc.com wrote:
Hi, all,

I read the document about how to adding a new data center to existing clusters 
posted at 
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html
But I have a question: Are all those steps executed only at the new adding 
cluster or on  existing clusters also? ( Step 7 is to be executed on the new 
cluster according to the document).

Thanks

Boying




Getting the counters with the highest values

2014-11-23 Thread Robert Wille
I’m working on moving a bunch of counters out of our relational database to 
Cassandra. For the most part, Cassandra is a very nice fit, except for one 
feature on our website. We manage a time series of view counts for each 
document, and display a list of the most popular documents in the last seven 
days. This seems like a pretty strong anti-pattern for Cassandra, but also 
seems like something a lot of people would want to do. If you’re keeping 
counters, its pretty likely that you’d want to know which ones have the highest 
counts. 

Here’s what I came up with to implement this feature. Create a counter table 
with primary key (doc_id, day) and a single counter. Whenever a document is 
viewed, increment the counter for the document for today and the previous six 
days. Sometime after midnight each day, compile the counters into a table with 
primary key (day, count, doc_id) and no additional columns. For each partition 
in the counter table, I would sum up the counters, delete any counters that are 
over a week old, and put the sum into the second table with day = today. When I 
query the table, i would ask for data where day = yesterday. During the 
compilation process, I would delete old partitions. In theory I’d only need two 
partitions. One that is being built, and one for querying.

I’d be interested to hear critiques on this strategy, as well as hearing how 
other people have implemented a most-popular feature using Cassandra counters.

Robert



Re: Compaction Strategy guidance

2014-11-23 Thread Nikolai Grigoriev
Just to clarify - when I was talking about the large amount of data I
really meant large amount of data per node in a single CF (table). LCS does
not seem to like it when it gets thousands of sstables (makes 4-5 levels).

When bootstraping a new node you'd better enable that option from
CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it
had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
not go down. Number of sstables at L0  is over 11K and it is slowly slowly
building upper levels. Total number of sstables is 4x the normal amount.
Now I am not entirely sure if this node will ever get back to normal life.
And believe me - this is not because of I/O, I have SSDs everywhere and 16
physical cores. This machine is barely using 1-3 cores at most of the time.
The problem is that allowing STCS fallback is not a good option either - it
will quickly result in a few 200Gb+ sstables in my configuration and then
these sstables will never be compacted. Plus, it will require close to 2x
disk space on EVERY disk in my JBOD configuration...this will kill the node
sooner or later. This is all because all sstables after bootstrap end at L0
and then the process slowly slowly moves them to other levels. If you have
write traffic to that CF then the number of sstables and L0 will grow
quickly - like it happens in my case now.

Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301 is
implemented it may be better.


On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net wrote:

 Stephane,

 We are having a somewhat similar C* load profile. Hence some comments
 in addition Nikolai's answer.
 1. Fallback to STCS - you can disable it actually
 2. Based on our experience, if you have a lot of data per node, LCS
 may work just fine. That is, till the moment you decide to join
 another node - chances are that the newly added node will not be able
 to compact what it gets from old nodes. In your case, if you switch
 strategy the same thing may happen. This is all due to limitations
 mentioned by Nikolai.

 Andrei,


 On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com
 wrote:
  ABUSE
 
 
 
  YA NO QUIERO MAS MAILS SOY DE MEXICO
 
 
 
  De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
  Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
  Para: user@cassandra.apache.org
  Asunto: Re: Compaction Strategy guidance
  Importancia: Alta
 
 
 
  Stephane,
 
  As everything good, LCS comes at certain price.
 
  LCS will put most load on you I/O system (if you use spindles - you may
 need
  to be careful about that) and on CPU. Also LCS (by default) may fall
 back to
  STCS if it is falling behind (which is very possible with heavy writing
  activity) and this will result in higher disk space usage. Also LCS has
  certain limitation I have discovered lately. Sometimes LCS may not be
 able
  to use all your node's resources (algorithm limitations) and this reduces
  the overall compaction throughput. This may happen if you have a large
  column family with lots of data per node. STCS won't have this
 limitation.
 
 
 
  By the way, the primary goal of LCS is to reduce the number of sstables
 C*
  has to look at to find your data. With LCS properly functioning this
 number
  will be most likely between something like 1 and 3 for most of the reads.
  But if you do few reads and not concerned about the latency today, most
  likely LCS may only save you some disk space.
 
 
 
  On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com
  wrote:
 
  Hi there,
 
 
 
  use case:
 
 
 
  - Heavy write app, few reads.
 
  - Lots of updates of rows / columns.
 
  - Current performance is fine, for both writes and reads..
 
  - Currently using SizedCompactionStrategy
 
 
 
  We're trying to limit the amount of storage used during compaction.
 Should
  we switch to LeveledCompactionStrategy?
 
 
 
  Thanks
 
 
 
 
  --
 
  Nikolai Grigoriev
  (514) 772-5178




-- 
Nikolai Grigoriev
(514) 772-5178


Re: Compaction Strategy guidance

2014-11-23 Thread Jean-Armel Luce
Hi Nikolai,

Thanks for those informations.

Please could you clarify a little bit what you call 

2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com:

 Just to clarify - when I was talking about the large amount of data I
 really meant large amount of data per node in a single CF (table). LCS does
 not seem to like it when it gets thousands of sstables (makes 4-5 levels).

 When bootstraping a new node you'd better enable that option from
 CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
 mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it
 had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
 not go down. Number of sstables at L0  is over 11K and it is slowly slowly
 building upper levels. Total number of sstables is 4x the normal amount.
 Now I am not entirely sure if this node will ever get back to normal life.
 And believe me - this is not because of I/O, I have SSDs everywhere and 16
 physical cores. This machine is barely using 1-3 cores at most of the time.
 The problem is that allowing STCS fallback is not a good option either - it
 will quickly result in a few 200Gb+ sstables in my configuration and then
 these sstables will never be compacted. Plus, it will require close to 2x
 disk space on EVERY disk in my JBOD configuration...this will kill the node
 sooner or later. This is all because all sstables after bootstrap end at L0
 and then the process slowly slowly moves them to other levels. If you have
 write traffic to that CF then the number of sstables and L0 will grow
 quickly - like it happens in my case now.

 Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301
 is implemented it may be better.


 On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net
 wrote:

 Stephane,

 We are having a somewhat similar C* load profile. Hence some comments
 in addition Nikolai's answer.
 1. Fallback to STCS - you can disable it actually
 2. Based on our experience, if you have a lot of data per node, LCS
 may work just fine. That is, till the moment you decide to join
 another node - chances are that the newly added node will not be able
 to compact what it gets from old nodes. In your case, if you switch
 strategy the same thing may happen. This is all due to limitations
 mentioned by Nikolai.

 Andrei,


 On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com
 wrote:
  ABUSE
 
 
 
  YA NO QUIERO MAS MAILS SOY DE MEXICO
 
 
 
  De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
  Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
  Para: user@cassandra.apache.org
  Asunto: Re: Compaction Strategy guidance
  Importancia: Alta
 
 
 
  Stephane,
 
  As everything good, LCS comes at certain price.
 
  LCS will put most load on you I/O system (if you use spindles - you may
 need
  to be careful about that) and on CPU. Also LCS (by default) may fall
 back to
  STCS if it is falling behind (which is very possible with heavy writing
  activity) and this will result in higher disk space usage. Also LCS has
  certain limitation I have discovered lately. Sometimes LCS may not be
 able
  to use all your node's resources (algorithm limitations) and this
 reduces
  the overall compaction throughput. This may happen if you have a large
  column family with lots of data per node. STCS won't have this
 limitation.
 
 
 
  By the way, the primary goal of LCS is to reduce the number of sstables
 C*
  has to look at to find your data. With LCS properly functioning this
 number
  will be most likely between something like 1 and 3 for most of the
 reads.
  But if you do few reads and not concerned about the latency today, most
  likely LCS may only save you some disk space.
 
 
 
  On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com
  wrote:
 
  Hi there,
 
 
 
  use case:
 
 
 
  - Heavy write app, few reads.
 
  - Lots of updates of rows / columns.
 
  - Current performance is fine, for both writes and reads..
 
  - Currently using SizedCompactionStrategy
 
 
 
  We're trying to limit the amount of storage used during compaction.
 Should
  we switch to LeveledCompactionStrategy?
 
 
 
  Thanks
 
 
 
 
  --
 
  Nikolai Grigoriev
  (514) 772-5178




 --
 Nikolai Grigoriev
 (514) 772-5178



Re: Compaction Strategy guidance

2014-11-23 Thread Jean-Armel Luce
Hi Nikolai,

Please could you clarify a little bit what you call a large amount of
data ?

How many tables ?
How many rows in your largest table ?
How many GB in your largest table ?
How many GB per node ?

Thanks.



2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com:

 Hi Nikolai,

 Thanks for those informations.

 Please could you clarify a little bit what you call 

 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com:

 Just to clarify - when I was talking about the large amount of data I
 really meant large amount of data per node in a single CF (table). LCS does
 not seem to like it when it gets thousands of sstables (makes 4-5 levels).

 When bootstraping a new node you'd better enable that option from
 CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
 mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it
 had 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does
 not go down. Number of sstables at L0  is over 11K and it is slowly slowly
 building upper levels. Total number of sstables is 4x the normal amount.
 Now I am not entirely sure if this node will ever get back to normal life.
 And believe me - this is not because of I/O, I have SSDs everywhere and 16
 physical cores. This machine is barely using 1-3 cores at most of the time.
 The problem is that allowing STCS fallback is not a good option either - it
 will quickly result in a few 200Gb+ sstables in my configuration and then
 these sstables will never be compacted. Plus, it will require close to 2x
 disk space on EVERY disk in my JBOD configuration...this will kill the node
 sooner or later. This is all because all sstables after bootstrap end at L0
 and then the process slowly slowly moves them to other levels. If you have
 write traffic to that CF then the number of sstables and L0 will grow
 quickly - like it happens in my case now.

 Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301
 is implemented it may be better.


 On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net
 wrote:

 Stephane,

 We are having a somewhat similar C* load profile. Hence some comments
 in addition Nikolai's answer.
 1. Fallback to STCS - you can disable it actually
 2. Based on our experience, if you have a lot of data per node, LCS
 may work just fine. That is, till the moment you decide to join
 another node - chances are that the newly added node will not be able
 to compact what it gets from old nodes. In your case, if you switch
 strategy the same thing may happen. This is all due to limitations
 mentioned by Nikolai.

 Andrei,


 On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com
 wrote:
  ABUSE
 
 
 
  YA NO QUIERO MAS MAILS SOY DE MEXICO
 
 
 
  De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
  Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
  Para: user@cassandra.apache.org
  Asunto: Re: Compaction Strategy guidance
  Importancia: Alta
 
 
 
  Stephane,
 
  As everything good, LCS comes at certain price.
 
  LCS will put most load on you I/O system (if you use spindles - you
 may need
  to be careful about that) and on CPU. Also LCS (by default) may fall
 back to
  STCS if it is falling behind (which is very possible with heavy writing
  activity) and this will result in higher disk space usage. Also LCS has
  certain limitation I have discovered lately. Sometimes LCS may not be
 able
  to use all your node's resources (algorithm limitations) and this
 reduces
  the overall compaction throughput. This may happen if you have a large
  column family with lots of data per node. STCS won't have this
 limitation.
 
 
 
  By the way, the primary goal of LCS is to reduce the number of
 sstables C*
  has to look at to find your data. With LCS properly functioning this
 number
  will be most likely between something like 1 and 3 for most of the
 reads.
  But if you do few reads and not concerned about the latency today, most
  likely LCS may only save you some disk space.
 
 
 
  On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com
  wrote:
 
  Hi there,
 
 
 
  use case:
 
 
 
  - Heavy write app, few reads.
 
  - Lots of updates of rows / columns.
 
  - Current performance is fine, for both writes and reads..
 
  - Currently using SizedCompactionStrategy
 
 
 
  We're trying to limit the amount of storage used during compaction.
 Should
  we switch to LeveledCompactionStrategy?
 
 
 
  Thanks
 
 
 
 
  --
 
  Nikolai Grigoriev
  (514) 772-5178




 --
 Nikolai Grigoriev
 (514) 772-5178





Re: Compaction Strategy guidance

2014-11-23 Thread Andrei Ivanov
Jean-Armel,

I have the same problem/state as Nikolai. Here are my stats:
~ 1 table
~ 10B records
~ 2TB/node x 6 nodes

Nikolai,
I'm sort of wondering if switching to some larger sstable_size_in_mb
(say 4096 or 8192 or something) with LCS may be a solution, even if
not absolutely permanent?
As for huge sstables, I do already have some 400-500GB tables. The
only idea how I can manage to compact them in the future is to offline
split them at some point. Does it make sense?

(I'm still doing a test drive and really need to understand how we are
going to handle that in production)

Andrei.



On Mon, Nov 24, 2014 at 10:30 AM, Jean-Armel Luce jaluc...@gmail.com wrote:
 Hi Nikolai,

 Please could you clarify a little bit what you call a large amount of data
 ?

 How many tables ?
 How many rows in your largest table ?
 How many GB in your largest table ?
 How many GB per node ?

 Thanks.



 2014-11-24 8:27 GMT+01:00 Jean-Armel Luce jaluc...@gmail.com:

 Hi Nikolai,

 Thanks for those informations.

 Please could you clarify a little bit what you call 

 2014-11-24 4:37 GMT+01:00 Nikolai Grigoriev ngrigor...@gmail.com:

 Just to clarify - when I was talking about the large amount of data I
 really meant large amount of data per node in a single CF (table). LCS does
 not seem to like it when it gets thousands of sstables (makes 4-5 levels).

 When bootstraping a new node you'd better enable that option from
 CASSANDRA-6621 (the one that disables STCS in L0). But it will still be a
 mess - I have a node that I have bootstrapped ~2 weeks ago. Initially it had
 7,5K pending compactions, now it has almost stabilized ad 4,6K. Does not go
 down. Number of sstables at L0  is over 11K and it is slowly slowly building
 upper levels. Total number of sstables is 4x the normal amount. Now I am not
 entirely sure if this node will ever get back to normal life. And believe me
 - this is not because of I/O, I have SSDs everywhere and 16 physical cores.
 This machine is barely using 1-3 cores at most of the time. The problem is
 that allowing STCS fallback is not a good option either - it will quickly
 result in a few 200Gb+ sstables in my configuration and then these sstables
 will never be compacted. Plus, it will require close to 2x disk space on
 EVERY disk in my JBOD configuration...this will kill the node sooner or
 later. This is all because all sstables after bootstrap end at L0 and then
 the process slowly slowly moves them to other levels. If you have write
 traffic to that CF then the number of sstables and L0 will grow quickly -
 like it happens in my case now.

 Once something like https://issues.apache.org/jira/browse/CASSANDRA-8301
 is implemented it may be better.


 On Sun, Nov 23, 2014 at 4:53 AM, Andrei Ivanov aiva...@iponweb.net
 wrote:

 Stephane,

 We are having a somewhat similar C* load profile. Hence some comments
 in addition Nikolai's answer.
 1. Fallback to STCS - you can disable it actually
 2. Based on our experience, if you have a lot of data per node, LCS
 may work just fine. That is, till the moment you decide to join
 another node - chances are that the newly added node will not be able
 to compact what it gets from old nodes. In your case, if you switch
 strategy the same thing may happen. This is all due to limitations
 mentioned by Nikolai.

 Andrei,


 On Sun, Nov 23, 2014 at 8:51 AM, Servando Muñoz G. smg...@gmail.com
 wrote:
  ABUSE
 
 
 
  YA NO QUIERO MAS MAILS SOY DE MEXICO
 
 
 
  De: Nikolai Grigoriev [mailto:ngrigor...@gmail.com]
  Enviado el: sábado, 22 de noviembre de 2014 07:13 p. m.
  Para: user@cassandra.apache.org
  Asunto: Re: Compaction Strategy guidance
  Importancia: Alta
 
 
 
  Stephane,
 
  As everything good, LCS comes at certain price.
 
  LCS will put most load on you I/O system (if you use spindles - you
  may need
  to be careful about that) and on CPU. Also LCS (by default) may fall
  back to
  STCS if it is falling behind (which is very possible with heavy
  writing
  activity) and this will result in higher disk space usage. Also LCS
  has
  certain limitation I have discovered lately. Sometimes LCS may not be
  able
  to use all your node's resources (algorithm limitations) and this
  reduces
  the overall compaction throughput. This may happen if you have a large
  column family with lots of data per node. STCS won't have this
  limitation.
 
 
 
  By the way, the primary goal of LCS is to reduce the number of
  sstables C*
  has to look at to find your data. With LCS properly functioning this
  number
  will be most likely between something like 1 and 3 for most of the
  reads.
  But if you do few reads and not concerned about the latency today,
  most
  likely LCS may only save you some disk space.
 
 
 
  On Sat, Nov 22, 2014 at 6:25 PM, Stephane Legay sle...@looplogic.com
  wrote:
 
  Hi there,
 
 
 
  use case:
 
 
 
  - Heavy write app, few reads.
 
  - Lots of updates of rows / columns.
 
  - Current performance is fine, for both writes and reads..