Re: Secondary Indexes On Partitioned Time Series Data Question

2013-08-02 Thread Gareth Collins
OK, thanks for the information.

Gareth

On Thu, Aug 1, 2013 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote:
 On Thu, Aug 1, 2013 at 12:49 PM, Gareth Collins gareth.o.coll...@gmail.com
 wrote:

 Would this be correct? Just making sure I understand how to best use
 secondary indexes in Cassandra with time series data.


 In general unless you ABSOLUTELY NEED the one unique feature of built-in
 Secondary Indexes (atomic update of base row and index) you should just use
 a normal column family for secondary index cases.

 =Rob


Secondary Indexes On Partitioned Time Series Data Question

2013-08-01 Thread Gareth Collins
Hello,

Say I have time series data for a table like this:

CREATE TABLE mytimeseries (
pk_part1  text,
partition bigint,  e.g. partition per day or per hour
pk_part2  text,  this is part of the partition key so I can
split write load
message_id  timeuuid,
secondary_key1  text,
secondary_key2   text,
.
more columns
.
PRIMARY KEY ((pk_part1, partition, pk_part2), message_id));

Most of the time I will need to do queries with
pk_part1/partition/pk_part2/message_id range. So this is what I
optimize for.

Sometimes, however, I will need to do queries with
pk_part1/partition/message_id range and some combination of
secondary_key1 (95% of the time there is a one-to-one relationship
with pk_part1) or secondary_key2 (for each secondary_key2 there will
be many pk_part2 values).

In this time series scenario, to efficiently make use of
secondary_key1/secondary_key2 as Cassandra secondary indexes for these
queries I assume that secondary_key1/secondary_key_2 would really need
to be composites combined into one column (in SQL I would create
multi-column indexes)? i.e.:

secondary_key_1 - pk_part1 + partition_key + real_secondary_key_1
secondary_key_2 - pl_part2 + partition_key + real_secondary_key_2

Would this be correct? Just making sure I understand how to best use
secondary indexes in Cassandra with time series data.

thanks in advance,
Gareth


Re: Secondary Indexes On Partitioned Time Series Data Question

2013-08-01 Thread Robert Coli
On Thu, Aug 1, 2013 at 12:49 PM, Gareth Collins
gareth.o.coll...@gmail.comwrote:

 Would this be correct? Just making sure I understand how to best use
 secondary indexes in Cassandra with time series data.


In general unless you ABSOLUTELY NEED the one unique feature of built-in
Secondary Indexes (atomic update of base row and index) you should just use
a normal column family for secondary index cases.

=Rob


Re: Secondary Indexes On Partitioned Time Series Data Question

2013-08-01 Thread Shahab Yunus
Hi Robert,

Can you shed some more light (or point towards some other resource) that
why you think built-in Secondary Indexes should not be used easily or
without much consideration? Thanks.

Regards,
Shahab


On Thu, Aug 1, 2013 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Aug 1, 2013 at 12:49 PM, Gareth Collins 
 gareth.o.coll...@gmail.com wrote:

 Would this be correct? Just making sure I understand how to best use
 secondary indexes in Cassandra with time series data.


 In general unless you ABSOLUTELY NEED the one unique feature of built-in
 Secondary Indexes (atomic update of base row and index) you should just use
 a normal column family for secondary index cases.

 =Rob



Re: Secondary Indexes On Partitioned Time Series Data Question

2013-08-01 Thread Robert Coli
On Thu, Aug 1, 2013 at 2:34 PM, Shahab Yunus shahab.yu...@gmail.com wrote:

 Can you shed some more light (or point towards some other resource) that
 why you think built-in Secondary Indexes should not be used easily or
 without much consideration? Thanks.


1) Secondary indexes are more or less modeled like a manual pseudo
Secondary Index CF would be.
2) Except they are more opaque than doing it yourself. For example you
cannot see information on them in nodetool cfstats.
3) And there have been a steady trickle of bugs which relate to their
implementation, in many cases resulting in them not returning the data they
should. [1]
4) These bugs would not apply to a manual pseudo Secondary Index CF.
5) And the only benefits you get are the marginal convenience of querying
the secondary index instead of a second CF, and atomic synchronized update.
6) Which most people do not actually need.

tl;dr : unless you need the atomic update property, just use a manual
pseudo secondary index CF

=Rob

[1] https://issues.apache.org/jira/browse/CASSANDRA-4785 ,
https://issues.apache.org/jira/browse/CASSANDRA-5540 ,
https://issues.apache.org/jira/browse/CASSANDRA-2897 , etc.


Re: Secondary Indexes On Partitioned Time Series Data Question

2013-08-01 Thread Shahab Yunus
Thanks a lot.

Regards,
Shahab


On Thu, Aug 1, 2013 at 8:32 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Aug 1, 2013 at 2:34 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Can you shed some more light (or point towards some other resource) that
 why you think built-in Secondary Indexes should not be used easily or
 without much consideration? Thanks.


 1) Secondary indexes are more or less modeled like a manual pseudo
 Secondary Index CF would be.
 2) Except they are more opaque than doing it yourself. For example you
 cannot see information on them in nodetool cfstats.
 3) And there have been a steady trickle of bugs which relate to their
 implementation, in many cases resulting in them not returning the data they
 should. [1]
 4) These bugs would not apply to a manual pseudo Secondary Index CF.
 5) And the only benefits you get are the marginal convenience of querying
 the secondary index instead of a second CF, and atomic synchronized update.
 6) Which most people do not actually need.

 tl;dr : unless you need the atomic update property, just use a manual
 pseudo secondary index CF

 =Rob

 [1] https://issues.apache.org/jira/browse/CASSANDRA-4785 ,
 https://issues.apache.org/jira/browse/CASSANDRA-5540 ,
 https://issues.apache.org/jira/browse/CASSANDRA-2897 , etc.