Re: Getting the most-recent version from time-series data

Tupshin Harper Tue, 25 Feb 2014 19:37:23 -0800

I failed to address the matter of not knowing the families in advance.

I can't really recommend any solution to that other than storing the list
of families in another structure that is readily queryable. I don't know
how many families you are thinking, but if it is in the millions or more,
You might consider constructing another table such as:
CREATE TABLE families (
  key int,
  family text,
  PRIMARY KEY (key, family)
);



store your families there, with a knowable set of keys (I suggest something
like the last 3 digits of the md5 hash of the family). So then you could
retrieve your families in nice sized batches
SELECT family FROM id WHERE key=0;
and then do the fan-out selects that I described previously.

-Tupshin


On Tue, Feb 25, 2014 at 10:15 PM, Tupshin Harper <tups...@tupshin.com>wrote:

> Hi Clint,
>
> What you are describing could actually be accomplished with the Thrift API
> and a multiget_slice with a slicerange having a count of 1. Initially I was
> thinking that this was an important feature gap between Thrift and CQL, and
> was going to suggest that it should be implemented (possible syntax is in
> https://issues.apache.org/jira/browse/CASSANDRA-6167 which is almost a
> superset of this feature).
>
> But then I was convinced by some colleagues, that with a modern CQL driver
> that is token aware, you are actually better off (in terms of latency,
> throughput, and reliability), by doing each query separately on the client.
>
> The reasoning is that if you did this with a single query, it would
> necessarily be sent to a coordinator that wouldn't own most of the data
> that you are looking for. That coordinator would then need to fan out the
> read to all the nodes owning the partitions you are looking for.
>
> Far better to just do it directly on the client. The token aware client
> will send each request for a row straight to a node that owns it. With a
> separate connection open to each node, this is done in parallel from the
> get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with
> a stored procedure, very very little additional overhead to the client,
> server, or network.
>
> -Tupshin
>
>
> On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <clint.ke...@gmail.com>wrote:
>
>> Hi everyone,
>>
>> Let's say that I have a table that looks like the following:
>>
>> CREATE TABLE time_series_stuff (
>>   key text,
>>   family text,
>>   version int,
>>   val text,
>>   PRIMARY KEY (key, family, version)
>> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND
>>   bloom_filter_fp_chance=0.010000 AND
>>   caching='KEYS_ONLY' AND
>>   comment='' AND
>>   dclocal_read_repair_chance=0.000000 AND
>>   gc_grace_seconds=864000 AND
>>   index_interval=128 AND
>>   read_repair_chance=0.100000 AND
>>   replicate_on_write='true' AND
>>   populate_io_cache_on_flush='false' AND
>>   default_time_to_live=0 AND
>>   speculative_retry='99.0PERCENTILE' AND
>>   memtable_flush_period_in_ms=0 AND
>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>> cqlsh:fiddle> select * from time_series_stuff ;
>>
>>  key    | family  | version | val
>> --------+---------+---------+--------
>>  monday | revenue |       3 | $$$$$$
>>  monday | revenue |       2 |    $$$
>>  monday | revenue |       1 |     $$
>>  monday | revenue |       0 |      $
>>  monday | traffic |       2 | medium
>>  monday | traffic |       1 |  light
>>  monday | traffic |       0 |  heavy
>>
>> (7 rows)
>>
>> Now let's say that I'd like to perform a query that gets me the most
>> recent N versions of "revenue" and "traffic."
>>
>> Is there a CQL query to do this?  Let's say that N=1.  Then I know that I
>> can do:
>>
>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>> family='revenue' limit 1;
>>
>>  key    | family  | version | val
>> --------+---------+---------+--------
>>  monday | revenue |       3 | $$$$$$
>>
>> (1 rows)
>>
>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and
>> family='traffic' limit 1;
>>
>>  key    | family  | version | val
>> --------+---------+---------+--------
>>  monday | traffic |       2 | medium
>>
>> (1 rows)
>>
>> But what if I have lots of "families" and I want to get the most recent N
>> versions of all of them in a single CQL statement.  Is that possible?
>> Unfortunately I am working on something where the family names and the
>> number of most-recent versions are not known a priori (I am porting some
>> code that was designed for HBase).
>>
>> Best regards,
>> Clint
>>
>
>

Re: Getting the most-recent version from time-series data

Reply via email to