Re: How to maintain the N-most-recent versions of a value?
If the versions can be guaranteed to be a adjacent (i.e. if the latest version is V, the prior version is V-1) you could issue a delete at the same time as an insert for V-N-(buffer) where buffer = 0 In general guaranteeing that is probably hard, so this seems like something that would be nice to have C* manage for you. Unfortunately we don't have anything on the roadmap to help with this. A custom compaction strategy might do the trick, or permitting some filter during compaction that can omit/tombstone certain records based on the input data. This latter option probably wouldn't be too hard to implement, although it might not offer any guarantees about expiring records in order without incurring extra compaction cost (you could reasonably easily guarantee the most recent N are present, but the cleaning up of older records might happen haphazardly, in no particular order, and without any promptness guarantees, if you want to do it cheaply). Feel free to file a ticket, or submit a patch! On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, I am trying to design a schema that will keep the N-most-recent versions of a value. Currently my table looks like the following: CREATE TABLE foo ( rowkey text, family text, qualifier text, version long, value blob, PRIMARY KEY (rowkey, family, qualifier, version)) WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC)); Is there any standard design pattern for updating such a layout such that I keep the N-most-recent (version, value) pairs for every unique (rowkey, family, qualifier)? I can't think of any way to do this without doing a read-modify-write. The best thing I can think of is to use TTL to approximate the desired behavior (which will work if I know how often we are writing new data to the table). I could also use LIMIT N in my queries to limit myself to only N items, but that does not address any of the storage-size issues. In case anyone is curious, this question is related to some work that I am doing translating a system built on HBase (which provides this keep the N-most-recent-version-of-a-cell behavior) to Cassandra while providing the user with as-similar-as-possible an interface. Best regards, Clint
Re: How to maintain the N-most-recent versions of a value?
The cql you provided is invalid. You probably meant something like: CREATE TABLE foo ( rowkey text, family text, qualifier text, version int, value blob, PRIMARY KEY ((rowkey, family, qualifier), version)) WITH CLUSTERING ORDER BY (version DESC); We use ttl's and LIMIT for structures like these, paying attention to the construction of the partition key so that partition sizes are reasonable. If the blob might be large, store it somewhere else. We use S3 but you could also put it in another C* table. In 2.1 the row cache may help as it will store N rows per recently accessed partition, starting at the beginning of the partition. ml On Fri, Jul 18, 2014 at 6:30 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: If the versions can be guaranteed to be a adjacent (i.e. if the latest version is V, the prior version is V-1) you could issue a delete at the same time as an insert for V-N-(buffer) where buffer = 0 In general guaranteeing that is probably hard, so this seems like something that would be nice to have C* manage for you. Unfortunately we don't have anything on the roadmap to help with this. A custom compaction strategy might do the trick, or permitting some filter during compaction that can omit/tombstone certain records based on the input data. This latter option probably wouldn't be too hard to implement, although it might not offer any guarantees about expiring records in order without incurring extra compaction cost (you could reasonably easily guarantee the most recent N are present, but the cleaning up of older records might happen haphazardly, in no particular order, and without any promptness guarantees, if you want to do it cheaply). Feel free to file a ticket, or submit a patch! On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, I am trying to design a schema that will keep the N-most-recent versions of a value. Currently my table looks like the following: CREATE TABLE foo ( rowkey text, family text, qualifier text, version long, value blob, PRIMARY KEY (rowkey, family, qualifier, version)) WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC)); Is there any standard design pattern for updating such a layout such that I keep the N-most-recent (version, value) pairs for every unique (rowkey, family, qualifier)? I can't think of any way to do this without doing a read-modify-write. The best thing I can think of is to use TTL to approximate the desired behavior (which will work if I know how often we are writing new data to the table). I could also use LIMIT N in my queries to limit myself to only N items, but that does not address any of the storage-size issues. In case anyone is curious, this question is related to some work that I am doing translating a system built on HBase (which provides this keep the N-most-recent-version-of-a-cell behavior) to Cassandra while providing the user with as-similar-as-possible an interface. Best regards, Clint
Re: How to maintain the N-most-recent versions of a value?
You might be interested in the following ticket: https://issues.apache.org/jira/browse/CASSANDRA-3929 There's a patch available that was not integrated because it's not possible to guarantee exactly N values will be kept, and there are some other problems with deletions, but it may be useful depending on your usage characteristics. On Fri, Jul 18, 2014 at 7:58 AM, Laing, Michael michael.la...@nytimes.com wrote: The cql you provided is invalid. You probably meant something like: CREATE TABLE foo ( rowkey text, family text, qualifier text, version int, value blob, PRIMARY KEY ((rowkey, family, qualifier), version)) WITH CLUSTERING ORDER BY (version DESC); We use ttl's and LIMIT for structures like these, paying attention to the construction of the partition key so that partition sizes are reasonable. If the blob might be large, store it somewhere else. We use S3 but you could also put it in another C* table. In 2.1 the row cache may help as it will store N rows per recently accessed partition, starting at the beginning of the partition. ml On Fri, Jul 18, 2014 at 6:30 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: If the versions can be guaranteed to be a adjacent (i.e. if the latest version is V, the prior version is V-1) you could issue a delete at the same time as an insert for V-N-(buffer) where buffer = 0 In general guaranteeing that is probably hard, so this seems like something that would be nice to have C* manage for you. Unfortunately we don't have anything on the roadmap to help with this. A custom compaction strategy might do the trick, or permitting some filter during compaction that can omit/tombstone certain records based on the input data. This latter option probably wouldn't be too hard to implement, although it might not offer any guarantees about expiring records in order without incurring extra compaction cost (you could reasonably easily guarantee the most recent N are present, but the cleaning up of older records might happen haphazardly, in no particular order, and without any promptness guarantees, if you want to do it cheaply). Feel free to file a ticket, or submit a patch! On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, I am trying to design a schema that will keep the N-most-recent versions of a value. Currently my table looks like the following: CREATE TABLE foo ( rowkey text, family text, qualifier text, version long, value blob, PRIMARY KEY (rowkey, family, qualifier, version)) WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC)); Is there any standard design pattern for updating such a layout such that I keep the N-most-recent (version, value) pairs for every unique (rowkey, family, qualifier)? I can't think of any way to do this without doing a read-modify-write. The best thing I can think of is to use TTL to approximate the desired behavior (which will work if I know how often we are writing new data to the table). I could also use LIMIT N in my queries to limit myself to only N items, but that does not address any of the storage-size issues. In case anyone is curious, this question is related to some work that I am doing translating a system built on HBase (which provides this keep the N-most-recent-version-of-a-cell behavior) to Cassandra while providing the user with as-similar-as-possible an interface. Best regards, Clint -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
How to maintain the N-most-recent versions of a value?
Hi everyone, I am trying to design a schema that will keep the N-most-recent versions of a value. Currently my table looks like the following: CREATE TABLE foo ( rowkey text, family text, qualifier text, version long, value blob, PRIMARY KEY (rowkey, family, qualifier, version)) WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC)); Is there any standard design pattern for updating such a layout such that I keep the N-most-recent (version, value) pairs for every unique (rowkey, family, qualifier)? I can't think of any way to do this without doing a read-modify-write. The best thing I can think of is to use TTL to approximate the desired behavior (which will work if I know how often we are writing new data to the table). I could also use LIMIT N in my queries to limit myself to only N items, but that does not address any of the storage-size issues. In case anyone is curious, this question is related to some work that I am doing translating a system built on HBase (which provides this keep the N-most-recent-version-of-a-cell behavior) to Cassandra while providing the user with as-similar-as-possible an interface. Best regards, Clint
Re: How to maintain the N-most-recent versions of a value?
I would say that would work, but since already familiar with storage model from hbase and trying to emulate it may want to look into thrift interfaces. They little more similar to hbase interface (not as friendly to use and you cant use the very useful new client libraries from datastax) and accesses storage more directly, which is similar to hbases. You have your column family foo, then just use a composite column to store family, qualifier, and version in column name with value of column being value. row key is your row key. --- Chris Lohfink On Jul 17, 2014, at 6:32 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, I am trying to design a schema that will keep the N-most-recent versions of a value. Currently my table looks like the following: CREATE TABLE foo ( rowkey text, family text, qualifier text, version long, value blob, PRIMARY KEY (rowkey, family, qualifier, version)) WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC)); Is there any standard design pattern for updating such a layout such that I keep the N-most-recent (version, value) pairs for every unique (rowkey, family, qualifier)? I can't think of any way to do this without doing a read-modify-write. The best thing I can think of is to use TTL to approximate the desired behavior (which will work if I know how often we are writing new data to the table). I could also use LIMIT N in my queries to limit myself to only N items, but that does not address any of the storage-size issues. In case anyone is curious, this question is related to some work that I am doing translating a system built on HBase (which provides this keep the N-most-recent-version-of-a-cell behavior) to Cassandra while providing the user with as-similar-as-possible an interface. Best regards, Clint
Re: How to maintain the N-most-recent versions of a value?
In C* 2.1, the new row cache implementation keeps the most recent N partitions in memory, it might be of interest for you: http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1 On Fri, Jul 18, 2014 at 3:39 AM, Chris Lohfink clohf...@blackbirdit.com wrote: I would say that would work, but since already familiar with storage model from hbase and trying to emulate it may want to look into thrift interfaces. They little more similar to hbase interface (not as friendly to use and you cant use the very useful new client libraries from datastax) and accesses storage more directly, which is similar to hbases. You have your column family foo, then just use a composite column to store family, qualifier, and version in column name with value of column being value. row key is your row key. --- Chris Lohfink On Jul 17, 2014, at 6:32 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, I am trying to design a schema that will keep the N-most-recent versions of a value. Currently my table looks like the following: CREATE TABLE foo ( rowkey text, family text, qualifier text, version long, value blob, PRIMARY KEY (rowkey, family, qualifier, version)) WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version DESC)); Is there any standard design pattern for updating such a layout such that I keep the N-most-recent (version, value) pairs for every unique (rowkey, family, qualifier)? I can't think of any way to do this without doing a read-modify-write. The best thing I can think of is to use TTL to approximate the desired behavior (which will work if I know how often we are writing new data to the table). I could also use LIMIT N in my queries to limit myself to only N items, but that does not address any of the storage-size issues. In case anyone is curious, this question is related to some work that I am doing translating a system built on HBase (which provides this keep the N-most-recent-version-of-a-cell behavior) to Cassandra while providing the user with as-similar-as-possible an interface. Best regards, Clint