Re: How to maintain the N-most-recent versions of a value?

2014-07-18 Thread Benedict Elliott Smith
If the versions can be guaranteed to be a adjacent (i.e. if the latest
version is V, the prior version is V-1) you could issue a delete at the
same time as an insert for V-N-(buffer) where buffer = 0

In general guaranteeing that is probably hard, so this seems like something
that would be nice to have C* manage for you. Unfortunately we don't have
anything on the roadmap to help with this. A custom compaction strategy
might do the trick, or permitting some filter during compaction that can
omit/tombstone certain records based on the input data. This latter option
probably wouldn't be too hard to implement, although it might not offer any
guarantees about expiring records in order without incurring extra
compaction cost (you could reasonably easily guarantee the most recent N
are present, but the cleaning up of older records might happen haphazardly,
in no particular order, and without any promptness guarantees, if you want
to do it cheaply). Feel free to file a ticket, or submit a patch!


On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi everyone,

 I am trying to design a schema that will keep the N-most-recent
 versions of a value.  Currently my table looks like the following:

 CREATE TABLE foo (
 rowkey text,
 family text,
 qualifier text,
 version long,
 value blob,
 PRIMARY KEY (rowkey, family, qualifier, version))
 WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version
 DESC));

 Is there any standard design pattern for updating such a layout such
 that I keep the N-most-recent (version, value) pairs for every unique
 (rowkey, family, qualifier)?  I can't think of any way to do this
 without doing a read-modify-write.  The best thing I can think of is
 to use TTL to approximate the desired behavior (which will work if I
 know how often we are writing new data to the table).  I could also
 use LIMIT N in my queries to limit myself to only N items, but that
 does not address any of the storage-size issues.

 In case anyone is curious, this question is related to some work that
 I am doing translating a system built on HBase (which provides this
 keep the N-most-recent-version-of-a-cell behavior) to Cassandra
 while providing the user with as-similar-as-possible an interface.

 Best regards,
 Clint



Re: How to maintain the N-most-recent versions of a value?

2014-07-18 Thread Laing, Michael
The cql you provided is invalid. You probably meant something like:

CREATE TABLE foo (

 rowkey text,

 family text,

 qualifier text,

 version int,

 value blob,

 PRIMARY KEY ((rowkey, family, qualifier), version))

 WITH CLUSTERING ORDER BY (version DESC);


 We use ttl's and LIMIT for structures like these, paying attention to the
construction of the partition key so that partition sizes are reasonable.

If the blob might be large, store it somewhere else. We use S3 but you
could also put it in another C* table.

In 2.1 the row cache may help as it will store N rows per recently accessed
partition, starting at the beginning of the partition.

ml


On Fri, Jul 18, 2014 at 6:30 AM, Benedict Elliott Smith 
belliottsm...@datastax.com wrote:

 If the versions can be guaranteed to be a adjacent (i.e. if the latest
 version is V, the prior version is V-1) you could issue a delete at the
 same time as an insert for V-N-(buffer) where buffer = 0

 In general guaranteeing that is probably hard, so this seems like
 something that would be nice to have C* manage for you. Unfortunately we
 don't have anything on the roadmap to help with this. A custom compaction
 strategy might do the trick, or permitting some filter during compaction
 that can omit/tombstone certain records based on the input data. This
 latter option probably wouldn't be too hard to implement, although it might
 not offer any guarantees about expiring records in order without incurring
 extra compaction cost (you could reasonably easily guarantee the most
 recent N are present, but the cleaning up of older records might happen
 haphazardly, in no particular order, and without any promptness guarantees,
 if you want to do it cheaply). Feel free to file a ticket, or submit a
 patch!


 On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Hi everyone,

 I am trying to design a schema that will keep the N-most-recent
 versions of a value.  Currently my table looks like the following:

 CREATE TABLE foo (
 rowkey text,
 family text,
 qualifier text,
 version long,
 value blob,
 PRIMARY KEY (rowkey, family, qualifier, version))
 WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version
 DESC));

 Is there any standard design pattern for updating such a layout such
 that I keep the N-most-recent (version, value) pairs for every unique
 (rowkey, family, qualifier)?  I can't think of any way to do this
 without doing a read-modify-write.  The best thing I can think of is
 to use TTL to approximate the desired behavior (which will work if I
 know how often we are writing new data to the table).  I could also
 use LIMIT N in my queries to limit myself to only N items, but that
 does not address any of the storage-size issues.

 In case anyone is curious, this question is related to some work that
 I am doing translating a system built on HBase (which provides this
 keep the N-most-recent-version-of-a-cell behavior) to Cassandra
 while providing the user with as-similar-as-possible an interface.

 Best regards,
 Clint





Re: How to maintain the N-most-recent versions of a value?

2014-07-18 Thread Paulo Ricardo Motta Gomes
You might be interested in the following ticket:
https://issues.apache.org/jira/browse/CASSANDRA-3929

There's a patch available that was not integrated because it's not possible
to guarantee exactly N values will be kept, and there are some other
problems with deletions, but it may be useful depending on your usage
characteristics.


On Fri, Jul 18, 2014 at 7:58 AM, Laing, Michael michael.la...@nytimes.com
wrote:

 The cql you provided is invalid. You probably meant something like:

  CREATE TABLE foo (

 rowkey text,

 family text,

 qualifier text,

 version int,

 value blob,

  PRIMARY KEY ((rowkey, family, qualifier), version))

 WITH CLUSTERING ORDER BY (version DESC);


  We use ttl's and LIMIT for structures like these, paying attention to the
 construction of the partition key so that partition sizes are reasonable.

 If the blob might be large, store it somewhere else. We use S3 but you
 could also put it in another C* table.

 In 2.1 the row cache may help as it will store N rows per recently
 accessed partition, starting at the beginning of the partition.

 ml


 On Fri, Jul 18, 2014 at 6:30 AM, Benedict Elliott Smith 
 belliottsm...@datastax.com wrote:

 If the versions can be guaranteed to be a adjacent (i.e. if the latest
 version is V, the prior version is V-1) you could issue a delete at the
 same time as an insert for V-N-(buffer) where buffer = 0

 In general guaranteeing that is probably hard, so this seems like
 something that would be nice to have C* manage for you. Unfortunately we
 don't have anything on the roadmap to help with this. A custom compaction
 strategy might do the trick, or permitting some filter during compaction
 that can omit/tombstone certain records based on the input data. This
 latter option probably wouldn't be too hard to implement, although it might
 not offer any guarantees about expiring records in order without incurring
 extra compaction cost (you could reasonably easily guarantee the most
 recent N are present, but the cleaning up of older records might happen
 haphazardly, in no particular order, and without any promptness guarantees,
 if you want to do it cheaply). Feel free to file a ticket, or submit a
 patch!


 On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Hi everyone,

 I am trying to design a schema that will keep the N-most-recent
 versions of a value.  Currently my table looks like the following:

 CREATE TABLE foo (
 rowkey text,
 family text,
 qualifier text,
 version long,
 value blob,
 PRIMARY KEY (rowkey, family, qualifier, version))
 WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version
 DESC));

 Is there any standard design pattern for updating such a layout such
 that I keep the N-most-recent (version, value) pairs for every unique
 (rowkey, family, qualifier)?  I can't think of any way to do this
 without doing a read-modify-write.  The best thing I can think of is
 to use TTL to approximate the desired behavior (which will work if I
 know how often we are writing new data to the table).  I could also
 use LIMIT N in my queries to limit myself to only N items, but that
 does not address any of the storage-size issues.

 In case anyone is curious, this question is related to some work that
 I am doing translating a system built on HBase (which provides this
 keep the N-most-recent-version-of-a-cell behavior) to Cassandra
 while providing the user with as-similar-as-possible an interface.

 Best regards,
 Clint






-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


How to maintain the N-most-recent versions of a value?

2014-07-17 Thread Clint Kelly
Hi everyone,

I am trying to design a schema that will keep the N-most-recent
versions of a value.  Currently my table looks like the following:

CREATE TABLE foo (
rowkey text,
family text,
qualifier text,
version long,
value blob,
PRIMARY KEY (rowkey, family, qualifier, version))
WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC));

Is there any standard design pattern for updating such a layout such
that I keep the N-most-recent (version, value) pairs for every unique
(rowkey, family, qualifier)?  I can't think of any way to do this
without doing a read-modify-write.  The best thing I can think of is
to use TTL to approximate the desired behavior (which will work if I
know how often we are writing new data to the table).  I could also
use LIMIT N in my queries to limit myself to only N items, but that
does not address any of the storage-size issues.

In case anyone is curious, this question is related to some work that
I am doing translating a system built on HBase (which provides this
keep the N-most-recent-version-of-a-cell behavior) to Cassandra
while providing the user with as-similar-as-possible an interface.

Best regards,
Clint


Re: How to maintain the N-most-recent versions of a value?

2014-07-17 Thread Chris Lohfink
I would say that would work, but since already familiar with storage model from 
hbase and trying to emulate it may want to  look into thrift interfaces.  They 
little more similar to hbase interface (not as friendly to use and you cant use 
the very useful new client libraries from datastax) and accesses storage more 
directly, which is similar to hbases. You have your column family foo, then 
just use a composite column to store family, qualifier, and version in column 
name with value of column being value.  row key is your row key.

---
Chris Lohfink


On Jul 17, 2014, at 6:32 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi everyone,
 
 I am trying to design a schema that will keep the N-most-recent
 versions of a value.  Currently my table looks like the following:
 
 CREATE TABLE foo (
rowkey text,
family text,
qualifier text,
version long,
value blob,
PRIMARY KEY (rowkey, family, qualifier, version))
 WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC));
 
 Is there any standard design pattern for updating such a layout such
 that I keep the N-most-recent (version, value) pairs for every unique
 (rowkey, family, qualifier)?  I can't think of any way to do this
 without doing a read-modify-write.  The best thing I can think of is
 to use TTL to approximate the desired behavior (which will work if I
 know how often we are writing new data to the table).  I could also
 use LIMIT N in my queries to limit myself to only N items, but that
 does not address any of the storage-size issues.
 
 In case anyone is curious, this question is related to some work that
 I am doing translating a system built on HBase (which provides this
 keep the N-most-recent-version-of-a-cell behavior) to Cassandra
 while providing the user with as-similar-as-possible an interface.
 
 Best regards,
 Clint



Re: How to maintain the N-most-recent versions of a value?

2014-07-17 Thread DuyHai Doan
In C* 2.1, the new row cache implementation keeps the most recent N
partitions in memory, it might be of interest for you:
http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1


On Fri, Jul 18, 2014 at 3:39 AM, Chris Lohfink clohf...@blackbirdit.com
wrote:

 I would say that would work, but since already familiar with storage model
 from hbase and trying to emulate it may want to  look into thrift
 interfaces.  They little more similar to hbase interface (not as friendly
 to use and you cant use the very useful new client libraries from datastax)
 and accesses storage more directly, which is similar to hbases. You have
 your column family foo, then just use a composite column to store family,
 qualifier, and version in column name with value of column being value.
  row key is your row key.

 ---
 Chris Lohfink


 On Jul 17, 2014, at 6:32 PM, Clint Kelly clint.ke...@gmail.com wrote:

  Hi everyone,
 
  I am trying to design a schema that will keep the N-most-recent
  versions of a value.  Currently my table looks like the following:
 
  CREATE TABLE foo (
 rowkey text,
 family text,
 qualifier text,
 version long,
 value blob,
 PRIMARY KEY (rowkey, family, qualifier, version))
  WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version
 DESC));
 
  Is there any standard design pattern for updating such a layout such
  that I keep the N-most-recent (version, value) pairs for every unique
  (rowkey, family, qualifier)?  I can't think of any way to do this
  without doing a read-modify-write.  The best thing I can think of is
  to use TTL to approximate the desired behavior (which will work if I
  know how often we are writing new data to the table).  I could also
  use LIMIT N in my queries to limit myself to only N items, but that
  does not address any of the storage-size issues.
 
  In case anyone is curious, this question is related to some work that
  I am doing translating a system built on HBase (which provides this
  keep the N-most-recent-version-of-a-cell behavior) to Cassandra
  while providing the user with as-similar-as-possible an interface.
 
  Best regards,
  Clint