Local_serial >> Adding nodes

2017-06-06 Thread vasu gunja
Hi All,

We are having 2 DC setup each consists of 20 odd nodes and recently we
decided to add 6 more nodes to DC1.  We are using LWT's,  application
dirvers are configuared to use LOCAL_SERIAL.
As we are adding multiple nodes at a time we used option
"-Dcassandra.consistent.rangemovement=false"
we added all nodes with gap of 10 mins each

We are facing lot of timeouts more 30k transactions over 8 hours of period
. is anyone ran into same issue ? are we doing something.wrong ?



Thanks,
vasu


Reg:- Multi DC Configuration

2017-06-06 Thread @Nandan@
Hi ,

I am trying to Setup Cassandra 3.9 on Multi DC.
Currently, I am having 2 DCs with 3 and 2 nodes respectively.

DC1 Name :- India
Nodes :- 192.16.0.1 , 192.16.0.2, 192.16.0.3
DC2 Name :- USA
Nodes :- 172.16.0.1 , 172.16.0.2

Please help me to know which files I need to make changes for configuring
Multi DC successfully.

I am using Ubuntu 16.04 Operating System.

Thanks and Best Regards,
Nandan Priyadarshi


Re: Order by for aggregated values

2017-06-06 Thread Nate McCall
>
>
> My application is a real-time application. It monitors devices in the
> network and displays the top N devices for various parameters averaged over
> a time period. A query may involve anywhere from 10 to 50k devices, and
> anywhere from 5 to 2000 intervals. We expect a query to take less than 2
> seconds.
>
>
>
> My impression was that Spark is aimed at larger scale analytics.
>
>
>
> I am ok with the limitation on “group by”. I am intending to use async
> queries and token-aware load balancing to partition the query and execute
> it in parallel on each node.
>
>
>

This sounds a lot more like a use case for a streaming system (run in
parallel with Cassandra).

Apache Flink might be one avenue to explore - their Cassandra integration
works fine, btw.

A lot of folks are doing similar things with Apache Beam as well as it has
quite an elegant paradigm for the use case you describe, particularly if
you need to combine batching with streaming. (FYI, their "CassandraIO" is
about to be merged in master:
https://github.com/apache/beam/pull/592#issuecomment-306618338).


-- 
-
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: Order by for aggregated values

2017-06-06 Thread Jeff Jirsa


On 2017-06-05 19:00 (-0700), "Roger Fischer (CW)"  wrote: 
> Hello,
> 
> is there any intent to support "order by" and "limit" on aggregated values?
> 
> For time series data, top n queries are quite common. Group-by was the first 
> step towards supporting such queries, but ordering by value and limiting the 
> results are also required.
> 

For people interested in reading some related background:

https://issues.apache.org/jira/browse/CASSANDRA-10707 (GROUP BY)
https://issues.apache.org/jira/browse/CASSANDRA-11871 (Time series aggregation)

Distributed sorting/ordering/limits can be hard, but they're not impossible. If 
someone comes up with a way to do it efficiently, I'm sure the project would 
love to see it included. 

In the past, we've had issues where features were like landmines, they perhaps 
worked for a small subset of use cases, and then became sore points for other 
users (features like secondary indexes and old style counters). Since then, a 
lot of committers tend to only want to include features if they know they can 
scale to massive, busy clusters - because we know what hasn't worked in the 
past, and what sort of problems have been caused for innocent users. I hope 
there will eventually be a middle ground where we can be OK with stripping down 
implementations to support imperfect features on real clusters, as long as it 
doesn't cause things to blow up for people. I'm not confident this is such a 
feature that can be reasonably pared down, but perhaps someone will to suggest 
a way to do it such that it can be included, even if it's not 100% compatible 
with sql semantics. 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Partition range incremental repairs

2017-06-06 Thread Jonathan Haddad
I can't recommend *anyone* use incremental repair as there's some pretty
horrible bugs in it that can cause Merkle trees to wildly mismatch & result
in massive overstreaming.  Check out
https://issues.apache.org/jira/browse/CASSANDRA-9143.

TL;DR: Do not use incremental repair before 4.0.

On Tue, Jun 6, 2017 at 9:54 AM Anuj Wadehra 
wrote:

> Hi Chris,
>
> Can your share following info:
>
> 1. Exact repair commands you use for inc repair and pr repair
>
> 2. Repair time should be measured at cluster level for inc repair. So,
> whats the total time it takes to run repair on all nodes for incremental vs
> pr repairs?
>
> 3. You are repairing one dc DC3. How many DCs are there in total and whats
> the RF for keyspaces? Running pr on a specific dc would not repair entire
> data.
>
> 4. 885 ranges? From where did you get this number? Logs? Can you share the
> number ranges printed in logs for both inc and pr case?
>
>
> Thanks
> Anuj
>
>
> Sent from Yahoo Mail on Android
> 
>
> On Tue, Jun 6, 2017 at 9:33 PM, Chris Stokesmore
>
>  wrote:
> Thank you for the excellent and clear description of the different
> versions of repair Anuj, that has cleared up what I expect to be happening.
>
> The problem now is in our cluster, we are running repairs with options
> (parallelism: parallel, primary range: false, incremental: true, job
> threads: 1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges:
> 885) and when we do our repairs are taking over a day to complete when
> previously when running with the partition range option they were taking
> more like 8-9 hours.
>
> As I understand it, using incremental should have sped this process up as
> all three sets of data on each repair job should be marked as repaired
> however this does not seem to be the case. Any ideas?
>
> Chris
>
> On 6 Jun 2017, at 16:08, Anuj Wadehra 
> wrote:
>
> Hi Chris,
>
> Using pr with incremental repairs does not make sense. Primary range
> repair is an optimization over full repair. If you run full repair on a n
> node cluster with RF=3, you would be repairing each data thrice.
> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C .
> When full repair is run on node A, the entire data in that range gets
> synced with replicas on node B and C. Now, when you run full repair on
> nodes B and C, you are wasting resources on repairing data which is already
> repaired.
>
> Primary range repair ensures that when you run repair on a node, it ONLY
> repairs the data which is owned by the node. Thus, no node repairs data
> which is not owned by it and must be repaired by other node. Redundant work
> is eliminated.
>
> Even in pr, each time you run pr on all nodes, you repair 100% of data.
> Why to repair complete data in each cycle?? ..even data which has not even
> changed since the last repair cycle?
>
> This is where Incremental repair comes as an improvement. Once repaired, a
> data would be marked repaired so that the next repair cycle could just
> focus on repairing the delta. Now, lets go back to the example of 5 node
> cluster with RF =3.This time we run incremental repair on all nodes. When
> you repair entire data on node A, all 3 replicas are marked as repaired.
> Even if you run inc repair on all ranges on the second node, you would not
> re-repair the already repaired data. Thus, there is no advantage of
> repairing only the data owned by the node (primary range of the node). You
> can run inc repair on all the data present on a node and Cassandra would
> make sure that when you repair data on other nodes, you only repair
> unrepaired data.
>
> Thanks
> Anuj
>
>
>
> Sent from Yahoo Mail on Android
> 
>
> On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore
>  wrote:
> Hi all,
>
> Wondering if anyone had any thoughts on this? At the moment the long
> running repairs cause us to be running them on two nodes at once for a bit
> of time, which obivould increases the cluster load.
>
> On 2017-05-25 16:18 (+0100), Chris Stokesmore 
> wrote:
> > Hi,>
> >
> > We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been
> running repairs with the -pr option, via a cron job that runs on each node
> once per week.>
> >
> > We changed that as some advice on the Cassandra IRC channel said it
> would cause more anticompaction and  
> http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
> says 'Performing partitioner range repairs by using the -pr option is
> generally considered a good choice for doing manual repairs. However, this
> option cannot be used with incremental repairs (default for Cassandra 2.2
> and later)'
> >
> > Only problem is our -pr repairs were taking about 8 hours, and now the
> non-pr repair are taking 24+ - I guess 

Re: Understanding the limitation to only one non-PK column in MV-PK

2017-06-06 Thread DuyHai Doan
All the explanation for why just 1 non PK column can be used as PK for MV
is here:

https://skillsmatter.com/skillscasts/7446-cassandra-udf-and-materialised-views-in-depth

Skip to 19:18 for the explanation

On Mon, May 8, 2017 at 8:08 PM, Fridtjof Sander <
fridtjof.san...@googlemail.com> wrote:

> Hi,
>
> I'm struggling to understand some problems with respect to materialized
> views.
>
> First, I want to understand the example mentioned in
> https://issues.apache.org/jira/browse/CASSANDRA-9928 explaining how
> multiple non-PK columns in the view PK can lead to unrepairable/orphanized
> entries. I understand that only happens if a node dies that pushed an
> "intermediate" state (the result of only one of several updates affecting
> the same entry) to it's view replica. The case mentioned looks like the
> following: initially all nodes have (p=1, a=1, b=1). Then two concurrent
> updates are send: a=2 and b=2. One node gets b=2, deletes view (a=1, b=1,
> p=1) and inserts (a=1, b=2, p=1), then dies. The others get a=1, when is
> why they delete (a=1, b=1, p=1) and insert (a=2, b=1, p=1). Then (a=1,
> b=2, p=1) will never be deleted.
>
> What I don't understand is, why that can never happen with a single
> column. Consider (p=1, a=1) with two updates a=2 and a=3. One node receives
> a=2, deletes view entry (a=1, p=1) and inserts (a=2, p=1), then dies. The
> others get a=3, delete (a=1, p=1) and insert (a=3, p=1). Now, how is (a=2,
> p=1) removed from the view replica that was connected to the dying node? I
> don't get what's different here.
>
> I would really appreciate if someone could share some insight here!
>
> Fridtjof
>


Re: Order by for aggregated values

2017-06-06 Thread Jonathan Haddad
Unfortunately this feature falls in a category of *incredibly useful*
features that have gotten the -1 over the years because it doesn't scale
like we want it to.  As far as basic aggregations go, it's remarkably
trivial to roll up 100K-1MM items using very little memory, so at first it
seems like an easy problem.

There's a rub though.  Duy Hai is correct, there's a big issue with
pagination.  Paginating through results right now relies on tokens & not
offsets.  Paginating through aggregated data would require some serious
changes to how this works (I think).

It might be possible to generate temporary tables / partitions of the
aggregated results that are stored on disk & replicated to other nodes in
order to make pagination work correctly, but it starts to move into a fuzzy
area if it's even worth it.

For smaller datasets (under a few hundred thousand datapoints), I wouldn't
bother with Spark, it's overkill and imo the wrong tool for the job.  Ed
Capriolo had a suggestion for me that I loved a while ago - grab all the
raw data and operate on it in memory using H2 (for JVM) or Pandas / NumPy
(python).  This at least works with every version and won't require waiting
till Cassandra 5/6/7 is out.  Perform any rollups you might want and cache
them somewhere, perhaps back into a TTL'ed C* partition.

Jon

On Tue, Jun 6, 2017 at 11:39 AM DuyHai Doan  wrote:

> The problem is not that it's not feasible from Cassandra side, it is
>
> The problem is when doing arbitrary ORDER BY, Cassandra needs to resort to
> in-memory sorting of a potentially huge amout of data --> more pressure on
> heap --> impact on cluster stability
>
> Whereas delegating this kind of job to Spark which has appropriate data
> structure to lower heap pressure (Dataframe, project tungsten) is a better
> idea.
>
> "but in the Top N use case, far more data has to be transferred to the
> client when the client has to do the sorting"
>
> --> It is not true if you co-located your Spark worker with Cassandra
> nodes. In this case, Spark reading data out of Cassandra nodes are always
> node-local
>
>
>
> On Tue, Jun 6, 2017 at 6:20 PM, Roger Fischer (CW) 
> wrote:
>
>> Hi DuyHai,
>>
>>
>>
>> this is in response to the other points in your response.
>>
>>
>>
>> My application is a real-time application. It monitors devices in the
>> network and displays the top N devices for various parameters averaged over
>> a time period. A query may involve anywhere from 10 to 50k devices, and
>> anywhere from 5 to 2000 intervals. We expect a query to take less than 2
>> seconds.
>>
>>
>>
>> My impression was that Spark is aimed at larger scale analytics.
>>
>>
>>
>> I am ok with the limitation on “group by”. I am intending to use async
>> queries and token-aware load balancing to partition the query and execute
>> it in parallel on each node.
>>
>>
>>
>> Thanks…
>>
>>
>>
>> Roger
>>
>>
>>
>>
>>
>> *From:* DuyHai Doan [mailto:doanduy...@gmail.com]
>> *Sent:* Tuesday, June 06, 2017 12:31 AM
>> *To:* Roger Fischer (CW) 
>> *Cc:* user@cassandra.apache.org
>> *Subject:* Re: Order by for aggregated values
>>
>>
>>
>> First Group By is only allowed on partition keys and clustering columns,
>> not on arbitrary column. The internal implementation of group by tries to
>> fetch data on clustering order to avoid having to "re-sort" them in memory
>> which would be very expensive
>>
>>
>>
>> Second, group by works best when restricted to a single partition other
>> wise it will force Cassandra to do a range scan so poor performance
>>
>>
>>
>>
>>
>> For all of those reasons I don't expect an "order by" on aggregated
>> values to be available any soon
>>
>>
>>
>> Furthermore, Cassandra is optimised for real-time transactional
>> scenarios, the group by/order by/limit is typically a classical analytics
>> scenario, I would recommend to use the appropriate tool like Spark for that
>>
>>
>>
>>
>>
>> Le 6 juin 2017 04:00, "Roger Fischer (CW)"  a
>> écrit :
>>
>> Hello,
>>
>>
>>
>> is there any intent to support “order by” and “limit” on aggregated
>> values?
>>
>>
>>
>> For time series data, top n queries are quite common. Group-by was the
>> first step towards supporting such queries, but ordering by value and
>> limiting the results are also required.
>>
>>
>>
>> Thanks…
>>
>>
>>
>> Roger
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>


Re: Order by for aggregated values

2017-06-06 Thread DuyHai Doan
The problem is not that it's not feasible from Cassandra side, it is

The problem is when doing arbitrary ORDER BY, Cassandra needs to resort to
in-memory sorting of a potentially huge amout of data --> more pressure on
heap --> impact on cluster stability

Whereas delegating this kind of job to Spark which has appropriate data
structure to lower heap pressure (Dataframe, project tungsten) is a better
idea.

"but in the Top N use case, far more data has to be transferred to the
client when the client has to do the sorting"

--> It is not true if you co-located your Spark worker with Cassandra
nodes. In this case, Spark reading data out of Cassandra nodes are always
node-local



On Tue, Jun 6, 2017 at 6:20 PM, Roger Fischer (CW) 
wrote:

> Hi DuyHai,
>
>
>
> this is in response to the other points in your response.
>
>
>
> My application is a real-time application. It monitors devices in the
> network and displays the top N devices for various parameters averaged over
> a time period. A query may involve anywhere from 10 to 50k devices, and
> anywhere from 5 to 2000 intervals. We expect a query to take less than 2
> seconds.
>
>
>
> My impression was that Spark is aimed at larger scale analytics.
>
>
>
> I am ok with the limitation on “group by”. I am intending to use async
> queries and token-aware load balancing to partition the query and execute
> it in parallel on each node.
>
>
>
> Thanks…
>
>
>
> Roger
>
>
>
>
>
> *From:* DuyHai Doan [mailto:doanduy...@gmail.com]
> *Sent:* Tuesday, June 06, 2017 12:31 AM
> *To:* Roger Fischer (CW) 
> *Cc:* user@cassandra.apache.org
> *Subject:* Re: Order by for aggregated values
>
>
>
> First Group By is only allowed on partition keys and clustering columns,
> not on arbitrary column. The internal implementation of group by tries to
> fetch data on clustering order to avoid having to "re-sort" them in memory
> which would be very expensive
>
>
>
> Second, group by works best when restricted to a single partition other
> wise it will force Cassandra to do a range scan so poor performance
>
>
>
>
>
> For all of those reasons I don't expect an "order by" on aggregated values
> to be available any soon
>
>
>
> Furthermore, Cassandra is optimised for real-time transactional scenarios,
> the group by/order by/limit is typically a classical analytics scenario, I
> would recommend to use the appropriate tool like Spark for that
>
>
>
>
>
> Le 6 juin 2017 04:00, "Roger Fischer (CW)"  a écrit :
>
> Hello,
>
>
>
> is there any intent to support “order by” and “limit” on aggregated values?
>
>
>
> For time series data, top n queries are quite common. Group-by was the
> first step towards supporting such queries, but ordering by value and
> limiting the results are also required.
>
>
>
> Thanks…
>
>
>
> Roger
>
>
>
>
>
>
>
>
>


Re: Partition range incremental repairs

2017-06-06 Thread Anuj Wadehra
Hi Chris,
Can your share following info:
1. Exact repair commands you use for inc repair and pr repair
2. Repair time should be measured at cluster level for inc repair. So, whats 
the total time it takes to run repair on all nodes for incremental vs pr 
repairs?
3. You are repairing one dc DC3. How many DCs are there in total and whats the 
RF for keyspaces? Running pr on a specific dc would not repair entire data.
4. 885 ranges? From where did you get this number? Logs? Can you share the 
number ranges printed in logs for both inc and pr case?

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Tue, Jun 6, 2017 at 9:33 PM, Chris 
Stokesmore wrote:   Thank you for the excellent 
and clear description of the different versions of repair Anuj, that has 
cleared up what I expect to be happening.
The problem now is in our cluster, we are running repairs with options 
(parallelism: parallel, primary range: false, incremental: true, job threads: 
1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
when we do our repairs are taking over a day to complete when previously when 
running with the partition range option they were taking more like 8-9 hours.
As I understand it, using incremental should have sped this process up as all 
three sets of data on each repair job should be marked as repaired however this 
does not seem to be the case. Any ideas?
Chris

On 6 Jun 2017, at 16:08, Anuj Wadehra  wrote:
Hi Chris,
Using pr with incremental repairs does not make sense. Primary range repair is 
an optimization over full repair. If you run full repair on a n node cluster 
with RF=3, you would be repairing each data thrice. E.g. in a 5 node cluster 
with RF=3, a range may exist on node A,B and C . When full repair is run on 
node A, the entire data in that range gets synced with replicas on node B and 
C. Now, when you run full repair on nodes B and C, you are wasting resources on 
repairing data which is already repaired. 
Primary range repair ensures that when you run repair on a node, it ONLY 
repairs the data which is owned by the node. Thus, no node repairs data which 
is not owned by it and must be repaired by other node. Redundant work is 
eliminated. 
Even in pr, each time you run pr on all nodes, you repair 100% of data. Why to 
repair complete data in each cycle?? ..even data which has not even changed 
since the last repair cycle?
This is where Incremental repair comes as an improvement. Once repaired, a data 
would be marked repaired so that the next repair cycle could just focus on 
repairing the delta. Now, lets go back to the example of 5 node cluster with RF 
=3.This time we run incremental repair on all nodes. When you repair entire 
data on node A, all 3 replicas are marked as repaired. Even if you run inc 
repair on all ranges on the second node, you would not re-repair the already 
repaired data. Thus, there is no advantage of repairing only the data owned by 
the node (primary range of the node). You can run inc repair on all the data 
present on a node and Cassandra would make sure that when you repair data on 
other nodes, you only repair unrepaired data.
ThanksAnuj


Sent from Yahoo Mail on Android 
 
 On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore 
wrote:  Hi all,

Wondering if anyone had any thoughts on this? At the moment the long running 
repairs cause us to be running them on two nodes at once for a bit of time, 
which obivould increases the cluster load.

On 2017-05-25 16:18 (+0100), Chris Stokesmore  wrote: 
> Hi,> 
> 
> We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> repairs with the -pr option, via a cron job that runs on each node once per 
> week.> 
> 
> We changed that as some advice on the Cassandra IRC channel said it would 
> cause more anticompaction and  
> http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
>   says 'Performing partitioner range repairs by using the -pr option is 
> generally considered a good choice for doing manual repairs. However, this 
> option cannot be used with incremental repairs (default for Cassandra 2.2 and 
> later)'
> 
> Only problem is our -pr repairs were taking about 8 hours, and now the non-pr 
> repair are taking 24+ - I guess this makes sense, repairing 1/7 of data 
> increased to 3/7, except I was hoping to see a speed up after the first loop 
> through the cluster as each repair will be marking much more data as 
> repaired, right?> 
> 
> 
> Is running -pr with incremental repairs really that bad? > 
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org
  


  


RE: Order by for aggregated values

2017-06-06 Thread Roger Fischer (CW)
Hi DuyHai,

this is in response to the other points in your response.

My application is a real-time application. It monitors devices in the network 
and displays the top N devices for various parameters averaged over a time 
period. A query may involve anywhere from 10 to 50k devices, and anywhere from 
5 to 2000 intervals. We expect a query to take less than 2 seconds.

My impression was that Spark is aimed at larger scale analytics.

I am ok with the limitation on “group by”. I am intending to use async queries 
and token-aware load balancing to partition the query and execute it in 
parallel on each node.

Thanks…

Roger


From: DuyHai Doan [mailto:doanduy...@gmail.com]
Sent: Tuesday, June 06, 2017 12:31 AM
To: Roger Fischer (CW) 
Cc: user@cassandra.apache.org
Subject: Re: Order by for aggregated values

First Group By is only allowed on partition keys and clustering columns, not on 
arbitrary column. The internal implementation of group by tries to fetch data 
on clustering order to avoid having to "re-sort" them in memory which would be 
very expensive

Second, group by works best when restricted to a single partition other wise it 
will force Cassandra to do a range scan so poor performance


For all of those reasons I don't expect an "order by" on aggregated values to 
be available any soon

Furthermore, Cassandra is optimised for real-time transactional scenarios, the 
group by/order by/limit is typically a classical analytics scenario, I would 
recommend to use the appropriate tool like Spark for that


Le 6 juin 2017 04:00, "Roger Fischer (CW)" 
> a écrit :
Hello,

is there any intent to support “order by” and “limit” on aggregated values?

For time series data, top n queries are quite common. Group-by was the first 
step towards supporting such queries, but ordering by value and limiting the 
results are also required.

Thanks…

Roger






RE: Order by for aggregated values

2017-06-06 Thread Roger Fischer (CW)
Hi DuyHai,

thanks for your response.

I understand the reservations about implementing sorting in Cassandra. But I 
think it is analogous to filtering. It may be bad in the general case, but can 
be useful for particular use cases.

If Cassandra does not provide “order-by”, then the ordering has to be done in 
the client (or an intermediate tool like Spark). The cost of ordering will be 
the same, but in the Top N use case, far more data has to be transferred to the 
client when the client has to do the sorting.

So I think, with a qualification “ALLOW ORDERING”, it would be reasonable to 
support “order by” on aggregated values.

Thanks…

Roger



From: DuyHai Doan [mailto:doanduy...@gmail.com]
Sent: Tuesday, June 06, 2017 12:31 AM
To: Roger Fischer (CW) 
Cc: user@cassandra.apache.org
Subject: Re: Order by for aggregated values

First Group By is only allowed on partition keys and clustering columns, not on 
arbitrary column. The internal implementation of group by tries to fetch data 
on clustering order to avoid having to "re-sort" them in memory which would be 
very expensive

Second, group by works best when restricted to a single partition other wise it 
will force Cassandra to do a range scan so poor performance


For all of those reasons I don't expect an "order by" on aggregated values to 
be available any soon

Furthermore, Cassandra is optimised for real-time transactional scenarios, the 
group by/order by/limit is typically a classical analytics scenario, I would 
recommend to use the appropriate tool like Spark for that


Le 6 juin 2017 04:00, "Roger Fischer (CW)" 
> a écrit :
Hello,

is there any intent to support “order by” and “limit” on aggregated values?

For time series data, top n queries are quite common. Group-by was the first 
step towards supporting such queries, but ordering by value and limiting the 
results are also required.

Thanks…

Roger






Re: Partition range incremental repairs

2017-06-06 Thread Chris Stokesmore
Thank you for the excellent and clear description of the different versions of 
repair Anuj, that has cleared up what I expect to be happening.

The problem now is in our cluster, we are running repairs with options 
(parallelism: parallel, primary range: false, incremental: true, job threads: 
1, ColumnFamilies: [], dataCenters: [DC3], hosts: [], # of ranges: 885) and 
when we do our repairs are taking over a day to complete when previously when 
running with the partition range option they were taking more like 8-9 hours.

As I understand it, using incremental should have sped this process up as all 
three sets of data on each repair job should be marked as repaired however this 
does not seem to be the case. Any ideas?

Chris

> On 6 Jun 2017, at 16:08, Anuj Wadehra  wrote:
> 
> Hi Chris,
> 
> Using pr with incremental repairs does not make sense. Primary range repair 
> is an optimization over full repair. If you run full repair on a n node 
> cluster with RF=3, you would be repairing each data thrice. 
> E.g. in a 5 node cluster with RF=3, a range may exist on node A,B and C . 
> When full repair is run on node A, the entire data in that range gets synced 
> with replicas on node B and C. Now, when you run full repair on nodes B and 
> C, you are wasting resources on repairing data which is already repaired. 
> 
> Primary range repair ensures that when you run repair on a node, it ONLY 
> repairs the data which is owned by the node. Thus, no node repairs data which 
> is not owned by it and must be repaired by other node. Redundant work is 
> eliminated. 
> 
> Even in pr, each time you run pr on all nodes, you repair 100% of data. Why 
> to repair complete data in each cycle?? ..even data which has not even 
> changed since the last repair cycle?
> 
> This is where Incremental repair comes as an improvement. Once repaired, a 
> data would be marked repaired so that the next repair cycle could just focus 
> on repairing the delta. Now, lets go back to the example of 5 node cluster 
> with RF =3.This time we run incremental repair on all nodes. When you repair 
> entire data on node A, all 3 replicas are marked as repaired. Even if you run 
> inc repair on all ranges on the second node, you would not re-repair the 
> already repaired data. Thus, there is no advantage of repairing only the data 
> owned by the node (primary range of the node). You can run inc repair on all 
> the data present on a node and Cassandra would make sure that when you repair 
> data on other nodes, you only repair unrepaired data.
> 
> Thanks
> Anuj
> 
> 
> 
> Sent from Yahoo Mail on Android 
> 
> On Tue, Jun 6, 2017 at 4:27 PM, Chris Stokesmore
>  wrote:
> Hi all,
> 
> Wondering if anyone had any thoughts on this? At the moment the long running 
> repairs cause us to be running them on two nodes at once for a bit of time, 
> which obivould increases the cluster load.
> 
> On 2017-05-25 16:18 (+0100), Chris Stokesmore  > wrote: 
> > Hi,> 
> > 
> > We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> > repairs with the -pr option, via a cron job that runs on each node once per 
> > week.> 
> > 
> > We changed that as some advice on the Cassandra IRC channel said it would 
> > cause more anticompaction and  
> > http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
> >   
> > says
> >  'Performing partitioner range repairs by using the -pr option is generally 
> > considered a good choice for doing manual repairs. However, this option 
> > cannot be used with incremental repairs (default for Cassandra 2.2 and 
> > later)'
> > 
> > Only problem is our -pr repairs were taking about 8 hours, and now the 
> > non-pr repair are taking 24+ - I guess this makes sense, repairing 1/7 of 
> > data increased to 3/7, except I was hoping to see a speed up after the 
> > first loop through the cluster as each repair will be marking much more 
> > data as repaired, right?> 
> > 
> > 
> > Is running -pr with incremental repairs really that bad? > 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> 
> For additional commands, e-mail: user-h...@cassandra.apache.org 
> 



Re: Regular dropped READ messages

2017-06-06 Thread Vincent Rischmann
Thanks Alexander for the help, lots of good info in there.

I'll try to switch back to CMS and see how it fares.


On Tue, Jun 6, 2017, at 05:06 PM, Alexander Dejanovski wrote:
> Hi Vincent,
> 
> it is very clear, thanks for all the info.
> 
> I would not stick with G1 in your case, as it requires much more heap
> to perform correctly (>24GB).> CMS/ParNew should be much more efficient here 
> and I would go with some
> settings I usually apply on big workloads : 16GB heap / 6GB new gen /
> MaxTenuringThreshold = 5> 
> Large partitions are indeed putting pressure on your heap and
> tombstones as well.> One of your queries is particularly caveated : SELECT 
> app,
> platform, slug, partition, user_id, attributes, state, timezone,
> version FROM table WHERE app = ? AND platform = ? AND slug = ? AND
> partition = ? LIMIT ?> Although you're using the LIMIT clause, it will read 
> the whole
> partition, merge it in memory and only then will it apply the LIMIT.
> Check this blog post for more detailed info :
> http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html>
>  This can lead you to read the whole 450MB and all the tombstones even
> though you're only targeting a few rows in the partition.> Large partitions 
> are also creating heap pressure during compactions,
> which will issue warnings in the logs (look for "large partition").> 
> You should remove the delete/insert logged batch as it will spread
> over multiple partitions, which is bad for many reasons. It gives you
> no real atomicity, but just the guaranty that if one query succeeds,
> then the rest of the queries will eventually succeed (and that could
> possibly take some time, leaving the cluster in an inconsistent state
> in the meantime). Logged batches have a lot of overheads, one of them
> being a write of the queries to the batchlog table, which will be
> replicated to 2 other nodes, and then deleted after the batch has
> completed.> You'd better turn those into async queries with an external retry
> mechanism.> 
> Tuning the GC should help coping with your data modeling issues. 
> 
> For safety reasons, only change the GC settings for one canary
> node, observe and compare its behavior over a full day. If the
> results are satisfying, generalize to the rest of the cluster. You
> need to experience peak load to make sure the new settings are
> fixing your issues.> 
> Cheers,
> 
> 
> 
> On Tue, Jun 6, 2017 at 4:22 PM Vincent Rischmann
>  wrote:>> __
>> Hi Alexander.
>> 
>> Yeah, the minor GCs I see are usually around 300ms but sometimes
>> jumping to 1s or even more.>> 
>> Hardware specs are:
>>   - 8 core CPUs
>>   - 32 GB of RAM
>>   - 4 SSDs in hardware Raid 0, around 3TB of space per node
>>  
>> GC settings:-Xmx12G -Xms12G -XX:+UseG1GC -
>> XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -
>> XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 -
>> XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled>> 
>> According to the graphs, there are approximately one Young GC every
>> 10s or so, and almost no Full GCs (for example the last one was 2h45
>> after the previous one).>> 
>> Computed from the log files, average Young GC seems to be around
>> 280ms and max is 2.5s.>> Average Full GC seems to be around 4.6s and max is 
>> 5.3s.
>> I only computed this on one node but the problem occurs on every node
>> as far as I can see.>> 
>> I'm open to tuning the GC, I stuck with defaults (that I think I saw
>> in the cassandra conf, I'm not sure).>> 
>> Number of SSTables looks ok, p75 is at 4 (as is the max for that
>> matter). Partitions size is a problem yeah, this particular table
>> from which we read a lot has a max partition size of 450 MB. I've
>> known about this problem for a long time actually, we already did a
>> bunch of work reducing partition size I think a year ago, but this
>> particular table is tricky to change.>> 
>> One thing to note about this table is that we do a ton of DELETEs
>> regularly (that we can't really stop doing except completely
>> redesigning the table), so we have a ton of tombstones too. We have a
>> lot of warnings about the tombstone threshold when we do our selects
>> (things like "Read 2001 live and 2528 tombstone cells"). I suppose
>> this could be a factor ?>> 
>> Each query reads from a single partition key yes, but as said we
>> issue a lot of them at the same time.>> 
>> The table looks like this (simplified):
>> 
>> CREATE TABLE table (
>> app text,
>> platform text,
>> slug text,
>> partition int,
>> user_id text,
>> attributes blob,
>> state int,
>> timezone text,
>> version int,
>> PRIMARY KEY ((app, platform, slug, partition), user_id)
>> ) WITH CLUSTERING ORDER BY (user_id ASC)
>> 
>> And the main queries are:
>> 
>> SELECT app, platform, slug, partition, user_id, attributes,
>> state, timezone, version>> FROM table WHERE app = ? AND platform = ? 
>> AND 

Re: Partition range incremental repairs

2017-06-06 Thread Anuj Wadehra
Hi Chris,
Using pr with incremental repairs does not make sense. Primary range repair is 
an optimization over full repair. If you run full repair on a n node cluster 
with RF=3, you would be repairing each data thrice. E.g. in a 5 node cluster 
with RF=3, a range may exist on node A,B and C . When full repair is run on 
node A, the entire data in that range gets synced with replicas on node B and 
C. Now, when you run full repair on nodes B and C, you are wasting resources on 
repairing data which is already repaired. 
Primary range repair ensures that when you run repair on a node, it ONLY 
repairs the data which is owned by the node. Thus, no node repairs data which 
is not owned by it and must be repaired by other node. Redundant work is 
eliminated. 
Even in pr, each time you run pr on all nodes, you repair 100% of data. Why to 
repair complete data in each cycle?? ..even data which has not even changed 
since the last repair cycle?
This is where Incremental repair comes as an improvement. Once repaired, a data 
would be marked repaired so that the next repair cycle could just focus on 
repairing the delta. Now, lets go back to the example of 5 node cluster with RF 
=3.This time we run incremental repair on all nodes. When you repair entire 
data on node A, all 3 replicas are marked as repaired. Even if you run inc 
repair on all ranges on the second node, you would not re-repair the already 
repaired data. Thus, there is no advantage of repairing only the data owned by 
the node (primary range of the node). You can run inc repair on all the data 
present on a node and Cassandra would make sure that when you repair data on 
other nodes, you only repair unrepaired data.
ThanksAnuj


Sent from Yahoo Mail on Android 
 
  On Tue, Jun 6, 2017 at 4:27 PM, Chris 
Stokesmore wrote:   Hi all,

Wondering if anyone had any thoughts on this? At the moment the long running 
repairs cause us to be running them on two nodes at once for a bit of time, 
which obivould increases the cluster load.

On 2017-05-25 16:18 (+0100), Chris Stokesmore  wrote: 
> Hi,> 
> 
> We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> repairs with the -pr option, via a cron job that runs on each node once per 
> week.> 
> 
> We changed that as some advice on the Cassandra IRC channel said it would 
> cause more anticompaction and  
> http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
>   says 'Performing partitioner range repairs by using the -pr option is 
> generally considered a good choice for doing manual repairs. However, this 
> option cannot be used with incremental repairs (default for Cassandra 2.2 and 
> later)'
> 
> Only problem is our -pr repairs were taking about 8 hours, and now the non-pr 
> repair are taking 24+ - I guess this makes sense, repairing 1/7 of data 
> increased to 3/7, except I was hoping to see a speed up after the first loop 
> through the cluster as each repair will be marking much more data as 
> repaired, right?> 
> 
> 
> Is running -pr with incremental repairs really that bad? > 
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org
  


Re: Regular dropped READ messages

2017-06-06 Thread Alexander Dejanovski
Hi Vincent,

it is very clear, thanks for all the info.

I would not stick with G1 in your case, as it requires much more heap to
perform correctly (>24GB).
CMS/ParNew should be much more efficient here and I would go with some
settings I usually apply on big workloads : 16GB heap / 6GB new gen
/ MaxTenuringThreshold = 5

Large partitions are indeed putting pressure on your heap and tombstones as
well.
One of your queries is particularly caveated : SELECT app, platform, slug,
partition, user_id, attributes, state, timezone, version FROM table WHERE
app = ? AND platform = ? AND slug = ? AND partition = ? LIMIT ?
Although you're using the LIMIT clause, it will read the whole partition,
merge it in memory and only then will it apply the LIMIT. Check this blog
post for more detailed info :
http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html
This can lead you to read the whole 450MB and all the tombstones even
though you're only targeting a few rows in the partition.
Large partitions are also creating heap pressure during compactions, which
will issue warnings in the logs (look for "large partition").

You should remove the delete/insert logged batch as it will spread over
multiple partitions, which is bad for many reasons. It gives you no real
atomicity, but just the guaranty that if one query succeeds, then the rest
of the queries will eventually succeed (and that could possibly take some
time, leaving the cluster in an inconsistent state in the meantime). Logged
batches have a lot of overheads, one of them being a write of the queries
to the batchlog table, which will be replicated to 2 other nodes, and then
deleted after the batch has completed.
You'd better turn those into async queries with an external retry mechanism.

Tuning the GC should help coping with your data modeling issues.

For safety reasons, only change the GC settings for one canary node,
observe and compare its behavior over a full day. If the results are
satisfying, generalize to the rest of the cluster. You need to experience
peak load to make sure the new settings are fixing your issues.

Cheers,



On Tue, Jun 6, 2017 at 4:22 PM Vincent Rischmann  wrote:

> Hi Alexander.
>
> Yeah, the minor GCs I see are usually around 300ms but sometimes jumping
> to 1s or even more.
>
> Hardware specs are:
>   - 8 core CPUs
>   - 32 GB of RAM
>   - 4 SSDs in hardware Raid 0, around 3TB of space per node
>
> GC settings:-Xmx12G -Xms12G -XX:+UseG1GC
> -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200
> -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8
> -XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled
>
> According to the graphs, there are approximately one Young GC every 10s or
> so, and almost no Full GCs (for example the last one was 2h45 after the
> previous one).
>
> Computed from the log files, average Young GC seems to be around 280ms and
> max is 2.5s.
> Average Full GC seems to be around 4.6s and max is 5.3s.
> I only computed this on one node but the problem occurs on every node as
> far as I can see.
>
> I'm open to tuning the GC, I stuck with defaults (that I think I saw in
> the cassandra conf, I'm not sure).
>
> Number of SSTables looks ok, p75 is at 4 (as is the max for that matter).
> Partitions size is a problem yeah, this particular table from which we read
> a lot has a max partition size of 450 MB. I've known about this problem for
> a long time actually, we already did a bunch of work reducing partition
> size I think a year ago, but this particular table is tricky to change.
>
> One thing to note about this table is that we do a ton of DELETEs
> regularly (that we can't really stop doing except completely redesigning
> the table), so we have a ton of tombstones too. We have a lot of warnings
> about the tombstone threshold when we do our selects (things like "Read
> 2001 live and 2528 tombstone cells"). I suppose this could be a factor ?
>
> Each query reads from a single partition key yes, but as said we issue a
> lot of them at the same time.
>
> The table looks like this (simplified):
>
> CREATE TABLE table (
> app text,
> platform text,
> slug text,
> partition int,
> user_id text,
> attributes blob,
> state int,
> timezone text,
> version int,
> PRIMARY KEY ((app, platform, slug, partition), user_id)
> ) WITH CLUSTERING ORDER BY (user_id ASC)
>
> And the main queries are:
>
> SELECT app, platform, slug, partition, user_id, attributes, state,
> timezone, version
> FROM table WHERE app = ? AND platform = ? AND slug = ? AND partition =
> ? LIMIT ?
>
> SELECT app, platform, slug, partition, user_id, attributes, state,
> timezone, version
> FROM table WHERE app = ? AND platform = ? AND slug = ? AND partition =
> ? AND user_id >= ? LIMIT ?
>
> partition is basically an integer that goes from 0 to 15, and we always
> select the 16 partitions in parallel.
>
> Note that we write 

Re: Regular dropped READ messages

2017-06-06 Thread Vincent Rischmann
Hi Alexander.

Yeah, the minor GCs I see are usually around 300ms but sometimes jumping
to 1s or even more.
Hardware specs are:
  - 8 core CPUs
  - 32 GB of RAM
  - 4 SSDs in hardware Raid 0, around 3TB of space per node
 
GC settings:-Xmx12G -Xms12G -XX:+UseG1GC -
XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 -
XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=8 -
XX:ConcGCThreads=8 -XX:+ParallelRefProcEnabled
According to the graphs, there are approximately one Young GC every 10s
or so, and almost no Full GCs (for example the last one was 2h45 after
the previous one).
Computed from the log files, average Young GC seems to be around 280ms
and max is 2.5s.Average Full GC seems to be around 4.6s and max is 5.3s.
I only computed this on one node but the problem occurs on every node as
far as I can see.
I'm open to tuning the GC, I stuck with defaults (that I think I saw in
the cassandra conf, I'm not sure).
Number of SSTables looks ok, p75 is at 4 (as is the max for that
matter). Partitions size is a problem yeah, this particular table from
which we read a lot has a max partition size of 450 MB. I've known about
this problem for a long time actually, we already did a bunch of work
reducing partition size I think a year ago, but this particular table is
tricky to change.
One thing to note about this table is that we do a ton of DELETEs
regularly (that we can't really stop doing except completely redesigning
the table), so we have a ton of tombstones too. We have a lot of
warnings about the tombstone threshold when we do our selects (things
like "Read 2001 live and 2528 tombstone cells"). I suppose this could be
a factor ?
Each query reads from a single partition key yes, but as said we issue a
lot of them at the same time.
The table looks like this (simplified):

CREATE TABLE table (
app text,
platform text,
slug text,
partition int,
user_id text,
attributes blob,
state int,
timezone text,
version int,
PRIMARY KEY ((app, platform, slug, partition), user_id)
) WITH CLUSTERING ORDER BY (user_id ASC)

And the main queries are:

SELECT app, platform, slug, partition, user_id, attributes, state,
timezone, versionFROM table WHERE app = ? AND platform = ? AND slug = ? 
AND partition
= ? LIMIT ?
SELECT app, platform, slug, partition, user_id, attributes, state,
timezone, versionFROM table WHERE app = ? AND platform = ? AND slug = ? 
AND partition
= ? AND user_id >= ? LIMIT ?
partition is basically an integer that goes from 0 to 15, and we always
select the 16 partitions in parallel.
Note that we write constantly to this table, to update some fields,
insert the user into the new "slug" (a slug is an amalgamation of
different parameters like state, timezone etc that allows us the
efficiently query all users from a particular "app" with a given "slug".
At least that's the idea, as seen here it causes us some trouble).
And yes, we do use batches to write this data, this is how we process
each user update:  - SELECT from a "master" slug to get the fields we need
  - from that, compute a list of slugs the user had and a list of slugs
the user should have (for example if he changes timezone we have to
update the slug)  - delete the user from the slug he shouldn't be in and 
insert the user
where he should be.The last part, delete/insert is done in a logged batch. 

I hope it's relatively clear.

On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote:
> Hi Vincent, 
> 
> dropped messages are indeed common in case of long GC pauses. 
> Having 4s to 6s pauses is not normal and is the sign of an unhealthy
> cluster. Minor GCs are usually faster but you can have long ones too.> 
> If you can share your hardware specs along with your current GC
> settings (CMS or G1, heap size, young gen size) and a distribution of
> GC pauses (rate of minor GCs, average and max duration of GCs) we
> could try to help you tune your heap settings.> You can activate full GC 
> logging which could help in fine tuning
> MaxTenuringThreshold and survivor space sizing.> 
> You should also check for max partition sizes and number of SSTables
> accessed per read. Run nodetool cfstats/cfhistograms on your tables to
> get both. p75 should be less or equal to 4 in number of SSTables  and
> you shouldn't have partitions over... let's say 300 MBs. Partitions >
> 1GB are a critical problem to address.> 
> Other things to consider are : 
> Do you read from a single partition for each query ? 
> Do you use collections that could spread over many SSTables ? 
> Do you use batches for writes (although your problem doesn't seem to
> be write related) ?> Can you share the queries from your scheduled selects 
> and the
> data model ?> 
> Cheers,
> 
> 
> On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann
>  wrote:>> __
>> Hi,
>> 
>> we have a cluster of 11 nodes running Cassandra 2.2.9 where we
>> regularly get READ messages 

Re: Regular dropped READ messages

2017-06-06 Thread Alexander Dejanovski
Hi Vincent,

dropped messages are indeed common in case of long GC pauses.
Having 4s to 6s pauses is not normal and is the sign of an unhealthy
cluster. Minor GCs are usually faster but you can have long ones too.

If you can share your hardware specs along with your current GC settings
(CMS or G1, heap size, young gen size) and a distribution of GC pauses
(rate of minor GCs, average and max duration of GCs) we could try to help
you tune your heap settings.
You can activate full GC logging which could help in fine tuning
MaxTenuringThreshold and survivor space sizing.

You should also check for max partition sizes and number of SSTables
accessed per read. Run nodetool cfstats/cfhistograms on your tables to get
both. p75 should be less or equal to 4 in number of SSTables  and you
shouldn't have partitions over... let's say 300 MBs. Partitions > 1GB are a
critical problem to address.

Other things to consider are :
Do you read from a single partition for each query ?
Do you use collections that could spread over many SSTables ?
Do you use batches for writes (although your problem doesn't seem to be
write related) ?
Can you share the queries from your scheduled selects and the data model ?

Cheers,


On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann  wrote:

> Hi,
>
> we have a cluster of 11 nodes running Cassandra 2.2.9 where we regularly
> get READ messages dropped:
>
> > READ messages were dropped in last 5000 ms: 974 for internal timeout and
> 0 for cross node timeout
>
> Looking at the logs, some are logged at the same time as Old Gen GCs.
> These GCs all take around 4 to 6s to run. To me, it's "normal" that these
> could cause reads to be dropped.
> However, we also have reads dropped without Old Gen GCs occurring, only
> Young Gen.
>
> I'm wondering if anyone has a good way of determining what the _root_
> cause could be. Up until now, the only way we managed to decrease load on
> our cluster was by guessing some stuff, trying it out and being lucky,
> essentially. I'd love a way to make sure what the problem is before
> tackling it. Doing schema changes is not a problem, but changing stuff
> blindly is not super efficient :)
>
> What I do see in the logs, is that these happen almost exclusively when we
> do a lot of SELECT.  The time logged almost always correspond to times
> where our schedules SELECTs are happening. That narrows the scope a little,
> but still.
>
> Anyway, I'd appreciate any information about troubleshooting this scenario.
> Thanks.
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Regular dropped READ messages

2017-06-06 Thread Vincent Rischmann
Hi,

we have a cluster of 11 nodes running Cassandra 2.2.9 where we regularly
get READ messages dropped:
> READ messages were dropped in last 5000 ms: 974 for internal timeout
> and 0 for cross node timeout
Looking at the logs, some are logged at the same time as Old Gen GCs.
These GCs all take around 4 to 6s to run. To me, it's "normal" that
these could cause reads to be dropped.However, we also have reads dropped 
without Old Gen GCs occurring, only
Young Gen.
I'm wondering if anyone has a good way of determining what the _root_
cause could be. Up until now, the only way we managed to decrease load
on our cluster was by guessing some stuff, trying it out and being
lucky, essentially. I'd love a way to make sure what the problem is
before tackling it. Doing schema changes is not a problem, but changing
stuff blindly is not super efficient :)
What I do see in the logs, is that these happen almost exclusively when
we do a lot of SELECT.  The time logged almost always correspond to
times where our schedules SELECTs are happening. That narrows the scope
a little, but still.
Anyway, I'd appreciate any information about troubleshooting this
scenario.Thanks.


Re: Partition range incremental repairs

2017-06-06 Thread Chris Stokesmore
Hi all,

Wondering if anyone had any thoughts on this? At the moment the long running 
repairs cause us to be running them on two nodes at once for a bit of time, 
which obivould increases the cluster load.

On 2017-05-25 16:18 (+0100), Chris Stokesmore  wrote: 
> Hi,> 
> 
> We are running a 7 node Cassandra 2.2.8 cluster, RF=3, and had been running 
> repairs with the -pr option, via a cron job that runs on each node once per 
> week.> 
> 
> We changed that as some advice on the Cassandra IRC channel said it would 
> cause more anticompaction and  
> http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
>   says 'Performing partitioner range repairs by using the -pr option is 
> generally considered a good choice for doing manual repairs. However, this 
> option cannot be used with incremental repairs (default for Cassandra 2.2 and 
> later)'
> 
> Only problem is our -pr repairs were taking about 8 hours, and now the non-pr 
> repair are taking 24+ - I guess this makes sense, repairing 1/7 of data 
> increased to 3/7, except I was hoping to see a speed up after the first loop 
> through the cluster as each repair will be marking much more data as 
> repaired, right?> 
> 
> 
> Is running -pr with incremental repairs really that bad? > 
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Order by for aggregated values

2017-06-06 Thread DuyHai Doan
First Group By is only allowed on partition keys and clustering columns,
not on arbitrary column. The internal implementation of group by tries to
fetch data on clustering order to avoid having to "re-sort" them in memory
which would be very expensive

Second, group by works best when restricted to a single partition other
wise it will force Cassandra to do a range scan so poor performance


For all of those reasons I don't expect an "order by" on aggregated values
to be available any soon

Furthermore, Cassandra is optimised for real-time transactional scenarios,
the group by/order by/limit is typically a classical analytics scenario, I
would recommend to use the appropriate tool like Spark for that


Le 6 juin 2017 04:00, "Roger Fischer (CW)"  a écrit :

Hello,



is there any intent to support “order by” and “limit” on aggregated values?



For time series data, top n queries are quite common. Group-by was the
first step towards supporting such queries, but ordering by value and
limiting the results are also required.



Thanks…



Roger