And one last clarification. Where I said "stored procedure" earlier, I meant "prepared statement". Sorry for the confusion. Too much typing while tired.
-Tupshin On Tue, Feb 25, 2014 at 10:36 PM, Tupshin Harper <tups...@tupshin.com>wrote: > I failed to address the matter of not knowing the families in advance. > > I can't really recommend any solution to that other than storing the list > of families in another structure that is readily queryable. I don't know > how many families you are thinking, but if it is in the millions or more, > You might consider constructing another table such as: > CREATE TABLE families ( > key int, > family text, > PRIMARY KEY (key, family) > ); > > > store your families there, with a knowable set of keys (I suggest > something like the last 3 digits of the md5 hash of the family). So then > you could retrieve your families in nice sized batches > SELECT family FROM id WHERE key=0; > and then do the fan-out selects that I described previously. > > -Tupshin > > > On Tue, Feb 25, 2014 at 10:15 PM, Tupshin Harper <tups...@tupshin.com>wrote: > >> Hi Clint, >> >> What you are describing could actually be accomplished with the Thrift >> API and a multiget_slice with a slicerange having a count of 1. Initially I >> was thinking that this was an important feature gap between Thrift and CQL, >> and was going to suggest that it should be implemented (possible syntax is >> in https://issues.apache.org/jira/browse/CASSANDRA-6167 which is almost >> a superset of this feature). >> >> But then I was convinced by some colleagues, that with a modern CQL >> driver that is token aware, you are actually better off (in terms of >> latency, throughput, and reliability), by doing each query separately on >> the client. >> >> The reasoning is that if you did this with a single query, it would >> necessarily be sent to a coordinator that wouldn't own most of the data >> that you are looking for. That coordinator would then need to fan out the >> read to all the nodes owning the partitions you are looking for. >> >> Far better to just do it directly on the client. The token aware client >> will send each request for a row straight to a node that owns it. With a >> separate connection open to each node, this is done in parallel from the >> get-go. Fewer hops. Less load on the coordinator. No bottlenecks. And with >> a stored procedure, very very little additional overhead to the client, >> server, or network. >> >> -Tupshin >> >> >> On Tue, Feb 25, 2014 at 7:48 PM, Clint Kelly <clint.ke...@gmail.com>wrote: >> >>> Hi everyone, >>> >>> Let's say that I have a table that looks like the following: >>> >>> CREATE TABLE time_series_stuff ( >>> key text, >>> family text, >>> version int, >>> val text, >>> PRIMARY KEY (key, family, version) >>> ) WITH CLUSTERING ORDER BY (family ASC, version DESC) AND >>> bloom_filter_fp_chance=0.010000 AND >>> caching='KEYS_ONLY' AND >>> comment='' AND >>> dclocal_read_repair_chance=0.000000 AND >>> gc_grace_seconds=864000 AND >>> index_interval=128 AND >>> read_repair_chance=0.100000 AND >>> replicate_on_write='true' AND >>> populate_io_cache_on_flush='false' AND >>> default_time_to_live=0 AND >>> speculative_retry='99.0PERCENTILE' AND >>> memtable_flush_period_in_ms=0 AND >>> compaction={'class': 'SizeTieredCompactionStrategy'} AND >>> compression={'sstable_compression': 'LZ4Compressor'}; >>> >>> cqlsh:fiddle> select * from time_series_stuff ; >>> >>> key | family | version | val >>> --------+---------+---------+-------- >>> monday | revenue | 3 | $$$$$$ >>> monday | revenue | 2 | $$$ >>> monday | revenue | 1 | $$ >>> monday | revenue | 0 | $ >>> monday | traffic | 2 | medium >>> monday | traffic | 1 | light >>> monday | traffic | 0 | heavy >>> >>> (7 rows) >>> >>> Now let's say that I'd like to perform a query that gets me the most >>> recent N versions of "revenue" and "traffic." >>> >>> Is there a CQL query to do this? Let's say that N=1. Then I know that >>> I can do: >>> >>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and >>> family='revenue' limit 1; >>> >>> key | family | version | val >>> --------+---------+---------+-------- >>> monday | revenue | 3 | $$$$$$ >>> >>> (1 rows) >>> >>> cqlsh:fiddle> select * from time_series_stuff where key='monday' and >>> family='traffic' limit 1; >>> >>> key | family | version | val >>> --------+---------+---------+-------- >>> monday | traffic | 2 | medium >>> >>> (1 rows) >>> >>> But what if I have lots of "families" and I want to get the most recent >>> N versions of all of them in a single CQL statement. Is that possible? >>> Unfortunately I am working on something where the family names and the >>> number of most-recent versions are not known a priori (I am porting some >>> code that was designed for HBase). >>> >>> Best regards, >>> Clint >>> >> >> >