Re: Secondary Indexes On Partitioned Time Series Data Question
OK, thanks for the information. Gareth On Thu, Aug 1, 2013 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 1, 2013 at 12:49 PM, Gareth Collins gareth.o.coll...@gmail.com wrote: Would this be correct? Just making sure I understand how to best use secondary indexes in Cassandra with time series data. In general unless you ABSOLUTELY NEED the one unique feature of built-in Secondary Indexes (atomic update of base row and index) you should just use a normal column family for secondary index cases. =Rob
Secondary Indexes On Partitioned Time Series Data Question
Hello, Say I have time series data for a table like this: CREATE TABLE mytimeseries ( pk_part1 text, partition bigint, e.g. partition per day or per hour pk_part2 text, this is part of the partition key so I can split write load message_id timeuuid, secondary_key1 text, secondary_key2 text, . more columns . PRIMARY KEY ((pk_part1, partition, pk_part2), message_id)); Most of the time I will need to do queries with pk_part1/partition/pk_part2/message_id range. So this is what I optimize for. Sometimes, however, I will need to do queries with pk_part1/partition/message_id range and some combination of secondary_key1 (95% of the time there is a one-to-one relationship with pk_part1) or secondary_key2 (for each secondary_key2 there will be many pk_part2 values). In this time series scenario, to efficiently make use of secondary_key1/secondary_key2 as Cassandra secondary indexes for these queries I assume that secondary_key1/secondary_key_2 would really need to be composites combined into one column (in SQL I would create multi-column indexes)? i.e.: secondary_key_1 - pk_part1 + partition_key + real_secondary_key_1 secondary_key_2 - pl_part2 + partition_key + real_secondary_key_2 Would this be correct? Just making sure I understand how to best use secondary indexes in Cassandra with time series data. thanks in advance, Gareth
Re: Secondary Indexes On Partitioned Time Series Data Question
On Thu, Aug 1, 2013 at 12:49 PM, Gareth Collins gareth.o.coll...@gmail.comwrote: Would this be correct? Just making sure I understand how to best use secondary indexes in Cassandra with time series data. In general unless you ABSOLUTELY NEED the one unique feature of built-in Secondary Indexes (atomic update of base row and index) you should just use a normal column family for secondary index cases. =Rob
Re: Secondary Indexes On Partitioned Time Series Data Question
Hi Robert, Can you shed some more light (or point towards some other resource) that why you think built-in Secondary Indexes should not be used easily or without much consideration? Thanks. Regards, Shahab On Thu, Aug 1, 2013 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 1, 2013 at 12:49 PM, Gareth Collins gareth.o.coll...@gmail.com wrote: Would this be correct? Just making sure I understand how to best use secondary indexes in Cassandra with time series data. In general unless you ABSOLUTELY NEED the one unique feature of built-in Secondary Indexes (atomic update of base row and index) you should just use a normal column family for secondary index cases. =Rob
Re: Secondary Indexes On Partitioned Time Series Data Question
On Thu, Aug 1, 2013 at 2:34 PM, Shahab Yunus shahab.yu...@gmail.com wrote: Can you shed some more light (or point towards some other resource) that why you think built-in Secondary Indexes should not be used easily or without much consideration? Thanks. 1) Secondary indexes are more or less modeled like a manual pseudo Secondary Index CF would be. 2) Except they are more opaque than doing it yourself. For example you cannot see information on them in nodetool cfstats. 3) And there have been a steady trickle of bugs which relate to their implementation, in many cases resulting in them not returning the data they should. [1] 4) These bugs would not apply to a manual pseudo Secondary Index CF. 5) And the only benefits you get are the marginal convenience of querying the secondary index instead of a second CF, and atomic synchronized update. 6) Which most people do not actually need. tl;dr : unless you need the atomic update property, just use a manual pseudo secondary index CF =Rob [1] https://issues.apache.org/jira/browse/CASSANDRA-4785 , https://issues.apache.org/jira/browse/CASSANDRA-5540 , https://issues.apache.org/jira/browse/CASSANDRA-2897 , etc.
Re: Secondary Indexes On Partitioned Time Series Data Question
Thanks a lot. Regards, Shahab On Thu, Aug 1, 2013 at 8:32 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 1, 2013 at 2:34 PM, Shahab Yunus shahab.yu...@gmail.comwrote: Can you shed some more light (or point towards some other resource) that why you think built-in Secondary Indexes should not be used easily or without much consideration? Thanks. 1) Secondary indexes are more or less modeled like a manual pseudo Secondary Index CF would be. 2) Except they are more opaque than doing it yourself. For example you cannot see information on them in nodetool cfstats. 3) And there have been a steady trickle of bugs which relate to their implementation, in many cases resulting in them not returning the data they should. [1] 4) These bugs would not apply to a manual pseudo Secondary Index CF. 5) And the only benefits you get are the marginal convenience of querying the secondary index instead of a second CF, and atomic synchronized update. 6) Which most people do not actually need. tl;dr : unless you need the atomic update property, just use a manual pseudo secondary index CF =Rob [1] https://issues.apache.org/jira/browse/CASSANDRA-4785 , https://issues.apache.org/jira/browse/CASSANDRA-5540 , https://issues.apache.org/jira/browse/CASSANDRA-2897 , etc.