Re: Merging cells in compaction / compression?
Btw, I'm not trying to say what you're asking for is a bad idea, or shouldn't / can't be done. If you're asking for a new feature, you should file a JIRA with all the details you provided above. Just keep in mind it'll be a while before it ends up in a stable version. The advice on this ML will usually gravitate towards solving your problem with the tools that are available today, as "wait a year or so" is usually unacceptable. https://issues.apache.org/jira/browse/cassandra/ On Fri, Aug 5, 2016 at 8:10 AM Jonathan Haddad <j...@jonhaddad.com> wrote: > I think Duy Hai was suggesting Spark Streaming, which gives you the tools > to build exactly what you asked for. A custom compression system for > packing batches of values for a partition into an optimized byte array. > > On Fri, Aug 5, 2016 at 7:46 AM Michael Burman <mibur...@redhat.com> wrote: > >> Hi, >> >> For storing time series data, storage disk usage is quite significant >> factor - time series applications generate a lot of data (and of course the >> newest data is most important). Given that even DateTiered compaction was >> designed in keeping mind of these specialities of time series data, >> wouldn't it make sense to also improve the storage efficiency? Cassandra >> 3.x's one of the key improvements was that improved storage engine - but >> it's still far away from being efficient with time series data. >> >> Efficient compression methods for both floating points & integers have a >> lot of research behind them and can be applied to time series data. I wish >> to apply these methods to improve storage efficiency - and performance* >> >> * In my experience, storing blocks of data and decompressing them on the >> client side instead of letting Cassandra read more rows improves >> performance by several times. The query patterns for time series data are >> often in requesting a range of data (instead of single datapoint). >> >> And I wasn't comparing Cassandra & Hadoop, but the combination of >> Spark+Cassandra+distributed-scheduler+other stuff vs. a Hadoop >> installation. At that point they are quite comparable in many cases, with >> latter being easier to manage in the end. I don't want either for a simple >> time series storage solution as I have no need for other components than >> data storage. >> >> - Micke >> >> - Original Message - >> From: "Jonathan Haddad" <j...@jonhaddad.com> >> To: user@cassandra.apache.org >> Sent: Friday, August 5, 2016 5:22:58 PM >> Subject: Re: Merging cells in compaction / compression? >> >> Hadoop and Cassandra have very different use cases. If the ability to >> write a custom compression system is the primary factor in how you choose >> your database I suspect you may run into some trouble. >> >> Jon >> >> On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com> >> wrote: >> >> > Hi, >> > >> > As Spark is an example of something I really don't want. It's resource >> > heavy, it involves copying data and it involves managing yet another >> > distributed system. Actually I would also need a distributed system to >> > schedule the spark jobs also. >> > >> > Sounds like a nightmare to implement a compression method. Might as well >> > run Hadoop. >> > >> > - Micke >> > >> > - Original Message - >> > From: "DuyHai Doan" <doanduy...@gmail.com> >> > To: user@cassandra.apache.org >> > Sent: Thursday, August 4, 2016 11:26:09 PM >> > Subject: Re: Merging cells in compaction / compression? >> > >> > Look like you're asking for some sort of ETL on your C* data, why not >> use >> > Spark to compress those data into blobs and use User-Defined-Function to >> > explode them when reading ? >> > >> > On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> >> > wrote: >> > >> > > Hi, >> > > >> > > No, I don't want to lose precision (if that's what you meant), but if >> you >> > > meant just storing them in a larger bucket (which I could decompress >> > either >> > > on client side or server side). To clarify, it could be like: >> > > >> > > 04082016T230215.1234, value >> > > 04082016T230225.4321, value >> > > 04082016T230235.2563, value >> > > 04082016T230245.1145, value >> > > 04082016T230255.0204, value >> > > >> > > -> >> >
Re: Merging cells in compaction / compression?
I think Duy Hai was suggesting Spark Streaming, which gives you the tools to build exactly what you asked for. A custom compression system for packing batches of values for a partition into an optimized byte array. On Fri, Aug 5, 2016 at 7:46 AM Michael Burman <mibur...@redhat.com> wrote: > Hi, > > For storing time series data, storage disk usage is quite significant > factor - time series applications generate a lot of data (and of course the > newest data is most important). Given that even DateTiered compaction was > designed in keeping mind of these specialities of time series data, > wouldn't it make sense to also improve the storage efficiency? Cassandra > 3.x's one of the key improvements was that improved storage engine - but > it's still far away from being efficient with time series data. > > Efficient compression methods for both floating points & integers have a > lot of research behind them and can be applied to time series data. I wish > to apply these methods to improve storage efficiency - and performance* > > * In my experience, storing blocks of data and decompressing them on the > client side instead of letting Cassandra read more rows improves > performance by several times. The query patterns for time series data are > often in requesting a range of data (instead of single datapoint). > > And I wasn't comparing Cassandra & Hadoop, but the combination of > Spark+Cassandra+distributed-scheduler+other stuff vs. a Hadoop > installation. At that point they are quite comparable in many cases, with > latter being easier to manage in the end. I don't want either for a simple > time series storage solution as I have no need for other components than > data storage. > > - Micke > > - Original Message - > From: "Jonathan Haddad" <j...@jonhaddad.com> > To: user@cassandra.apache.org > Sent: Friday, August 5, 2016 5:22:58 PM > Subject: Re: Merging cells in compaction / compression? > > Hadoop and Cassandra have very different use cases. If the ability to > write a custom compression system is the primary factor in how you choose > your database I suspect you may run into some trouble. > > Jon > > On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com> wrote: > > > Hi, > > > > As Spark is an example of something I really don't want. It's resource > > heavy, it involves copying data and it involves managing yet another > > distributed system. Actually I would also need a distributed system to > > schedule the spark jobs also. > > > > Sounds like a nightmare to implement a compression method. Might as well > > run Hadoop. > > > > - Micke > > > > - Original Message - > > From: "DuyHai Doan" <doanduy...@gmail.com> > > To: user@cassandra.apache.org > > Sent: Thursday, August 4, 2016 11:26:09 PM > > Subject: Re: Merging cells in compaction / compression? > > > > Look like you're asking for some sort of ETL on your C* data, why not use > > Spark to compress those data into blobs and use User-Defined-Function to > > explode them when reading ? > > > > On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> > > wrote: > > > > > Hi, > > > > > > No, I don't want to lose precision (if that's what you meant), but if > you > > > meant just storing them in a larger bucket (which I could decompress > > either > > > on client side or server side). To clarify, it could be like: > > > > > > 04082016T230215.1234, value > > > 04082016T230225.4321, value > > > 04082016T230235.2563, value > > > 04082016T230245.1145, value > > > 04082016T230255.0204, value > > > > > > -> > > > > > > 04082016T230200 -> blob (that has all the points for this minute > stored - > > > no data is lost to aggregated avgs or sums or anything). > > > > > > That's acceptable, of course the prettiest solution would be to keep > this > > > hidden from a client so it would see while decompressing the original > > rows > > > (like with byte[] compressors), but this is acceptable for my use-case. > > If > > > this is what you meant, then yes. > > > > > > - Micke > > > > > > - Original Message - > > > From: "Eric Stevens" <migh...@gmail.com> > > > To: user@cassandra.apache.org > > > Sent: Thursday, August 4, 2016 10:26:30 PM > > > Subject: Re: Merging cells in compaction / compression? > > > > > > When you say merge cells, do you mean re-aggr
Re: Merging cells in compaction / compression?
Hi, For storing time series data, storage disk usage is quite significant factor - time series applications generate a lot of data (and of course the newest data is most important). Given that even DateTiered compaction was designed in keeping mind of these specialities of time series data, wouldn't it make sense to also improve the storage efficiency? Cassandra 3.x's one of the key improvements was that improved storage engine - but it's still far away from being efficient with time series data. Efficient compression methods for both floating points & integers have a lot of research behind them and can be applied to time series data. I wish to apply these methods to improve storage efficiency - and performance* * In my experience, storing blocks of data and decompressing them on the client side instead of letting Cassandra read more rows improves performance by several times. The query patterns for time series data are often in requesting a range of data (instead of single datapoint). And I wasn't comparing Cassandra & Hadoop, but the combination of Spark+Cassandra+distributed-scheduler+other stuff vs. a Hadoop installation. At that point they are quite comparable in many cases, with latter being easier to manage in the end. I don't want either for a simple time series storage solution as I have no need for other components than data storage. - Micke - Original Message - From: "Jonathan Haddad" <j...@jonhaddad.com> To: user@cassandra.apache.org Sent: Friday, August 5, 2016 5:22:58 PM Subject: Re: Merging cells in compaction / compression? Hadoop and Cassandra have very different use cases. If the ability to write a custom compression system is the primary factor in how you choose your database I suspect you may run into some trouble. Jon On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com> wrote: > Hi, > > As Spark is an example of something I really don't want. It's resource > heavy, it involves copying data and it involves managing yet another > distributed system. Actually I would also need a distributed system to > schedule the spark jobs also. > > Sounds like a nightmare to implement a compression method. Might as well > run Hadoop. > > - Micke > > - Original Message - > From: "DuyHai Doan" <doanduy...@gmail.com> > To: user@cassandra.apache.org > Sent: Thursday, August 4, 2016 11:26:09 PM > Subject: Re: Merging cells in compaction / compression? > > Look like you're asking for some sort of ETL on your C* data, why not use > Spark to compress those data into blobs and use User-Defined-Function to > explode them when reading ? > > On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> > wrote: > > > Hi, > > > > No, I don't want to lose precision (if that's what you meant), but if you > > meant just storing them in a larger bucket (which I could decompress > either > > on client side or server side). To clarify, it could be like: > > > > 04082016T230215.1234, value > > 04082016T230225.4321, value > > 04082016T230235.2563, value > > 04082016T230245.1145, value > > 04082016T230255.0204, value > > > > -> > > > > 04082016T230200 -> blob (that has all the points for this minute stored - > > no data is lost to aggregated avgs or sums or anything). > > > > That's acceptable, of course the prettiest solution would be to keep this > > hidden from a client so it would see while decompressing the original > rows > > (like with byte[] compressors), but this is acceptable for my use-case. > If > > this is what you meant, then yes. > > > > - Micke > > > > - Original Message - > > From: "Eric Stevens" <migh...@gmail.com> > > To: user@cassandra.apache.org > > Sent: Thursday, August 4, 2016 10:26:30 PM > > Subject: Re: Merging cells in compaction / compression? > > > > When you say merge cells, do you mean re-aggregating the data into > courser > > time buckets? > > > > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> > wrote: > > > > > Hi, > > > > > > Considering the following example structure: > > > > > > CREATE TABLE data ( > > > metric text, > > > value double, > > > time timestamp, > > > PRIMARY KEY((metric), time) > > > ) WITH CLUSTERING ORDER BY (time DESC) > > > > > > The natural inserting order is metric, value, timestamp pairs, one > > > metric/value pair per second for example. That means creating more and > > more > > > cells to the same partition, which creates a large amount of overhead > an
Re: Merging cells in compaction / compression?
Hadoop and Cassandra have very different use cases. If the ability to write a custom compression system is the primary factor in how you choose your database I suspect you may run into some trouble. Jon On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com> wrote: > Hi, > > As Spark is an example of something I really don't want. It's resource > heavy, it involves copying data and it involves managing yet another > distributed system. Actually I would also need a distributed system to > schedule the spark jobs also. > > Sounds like a nightmare to implement a compression method. Might as well > run Hadoop. > > - Micke > > - Original Message - > From: "DuyHai Doan" <doanduy...@gmail.com> > To: user@cassandra.apache.org > Sent: Thursday, August 4, 2016 11:26:09 PM > Subject: Re: Merging cells in compaction / compression? > > Look like you're asking for some sort of ETL on your C* data, why not use > Spark to compress those data into blobs and use User-Defined-Function to > explode them when reading ? > > On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> > wrote: > > > Hi, > > > > No, I don't want to lose precision (if that's what you meant), but if you > > meant just storing them in a larger bucket (which I could decompress > either > > on client side or server side). To clarify, it could be like: > > > > 04082016T230215.1234, value > > 04082016T230225.4321, value > > 04082016T230235.2563, value > > 04082016T230245.1145, value > > 04082016T230255.0204, value > > > > -> > > > > 04082016T230200 -> blob (that has all the points for this minute stored - > > no data is lost to aggregated avgs or sums or anything). > > > > That's acceptable, of course the prettiest solution would be to keep this > > hidden from a client so it would see while decompressing the original > rows > > (like with byte[] compressors), but this is acceptable for my use-case. > If > > this is what you meant, then yes. > > > > - Micke > > > > - Original Message - > > From: "Eric Stevens" <migh...@gmail.com> > > To: user@cassandra.apache.org > > Sent: Thursday, August 4, 2016 10:26:30 PM > > Subject: Re: Merging cells in compaction / compression? > > > > When you say merge cells, do you mean re-aggregating the data into > courser > > time buckets? > > > > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> > wrote: > > > > > Hi, > > > > > > Considering the following example structure: > > > > > > CREATE TABLE data ( > > > metric text, > > > value double, > > > time timestamp, > > > PRIMARY KEY((metric), time) > > > ) WITH CLUSTERING ORDER BY (time DESC) > > > > > > The natural inserting order is metric, value, timestamp pairs, one > > > metric/value pair per second for example. That means creating more and > > more > > > cells to the same partition, which creates a large amount of overhead > and > > > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and > > > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve > > > compression ratio, how could I merge the cells on the actual Cassandra > > > node? I looked at ICompress and it provides only byte-level > compression. > > > > > > Could I do this on the compaction phase, by extending the > > > DateTieredCompaction for example? It has SSTableReader/Writer > facilities > > > and it seems to be able to see the rows? I'm fine with the fact that > > repair > > > run might have to do some conflict resolution as the final merged rows > > > would be quite "small" (50kB) in size. The naive approach is of course > to > > > fetch all the rows from Cassandra - merge them on the client and send > > back > > > to the Cassandra, but this seems very wasteful and has its own > problems. > > > Compared to table-LZ4 I was able to reduce the required size to 1/20th > > > (context-aware compression is sometimes just so much better) so there > are > > > real benefits to this approach, even if I would probably violate > multiple > > > design decisions. > > > > > > One approach is of course to write to another storage first and once > the > > > blocks are ready, write them to Cassandra. But that again seems idiotic > > (I > > > know some people are using Kafka in front of Cassandra for example, but > > > that means maintaining yet another distributed solution and defeats the > > > benefit of Cassandra's easy management & scalability). > > > > > > Has anyone done something similar? Even planned? If I need to extend > > > something in Cassandra I can accept that approach also - but as I'm not > > > that familiar with Cassandra source code I could use some hints. > > > > > > - Micke > > > > > >
Re: Merging cells in compaction / compression?
Hi, As Spark is an example of something I really don't want. It's resource heavy, it involves copying data and it involves managing yet another distributed system. Actually I would also need a distributed system to schedule the spark jobs also. Sounds like a nightmare to implement a compression method. Might as well run Hadoop. - Micke - Original Message - From: "DuyHai Doan" <doanduy...@gmail.com> To: user@cassandra.apache.org Sent: Thursday, August 4, 2016 11:26:09 PM Subject: Re: Merging cells in compaction / compression? Look like you're asking for some sort of ETL on your C* data, why not use Spark to compress those data into blobs and use User-Defined-Function to explode them when reading ? On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> wrote: > Hi, > > No, I don't want to lose precision (if that's what you meant), but if you > meant just storing them in a larger bucket (which I could decompress either > on client side or server side). To clarify, it could be like: > > 04082016T230215.1234, value > 04082016T230225.4321, value > 04082016T230235.2563, value > 04082016T230245.1145, value > 04082016T230255.0204, value > > -> > > 04082016T230200 -> blob (that has all the points for this minute stored - > no data is lost to aggregated avgs or sums or anything). > > That's acceptable, of course the prettiest solution would be to keep this > hidden from a client so it would see while decompressing the original rows > (like with byte[] compressors), but this is acceptable for my use-case. If > this is what you meant, then yes. > > - Micke > > - Original Message - > From: "Eric Stevens" <migh...@gmail.com> > To: user@cassandra.apache.org > Sent: Thursday, August 4, 2016 10:26:30 PM > Subject: Re: Merging cells in compaction / compression? > > When you say merge cells, do you mean re-aggregating the data into courser > time buckets? > > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> wrote: > > > Hi, > > > > Considering the following example structure: > > > > CREATE TABLE data ( > > metric text, > > value double, > > time timestamp, > > PRIMARY KEY((metric), time) > > ) WITH CLUSTERING ORDER BY (time DESC) > > > > The natural inserting order is metric, value, timestamp pairs, one > > metric/value pair per second for example. That means creating more and > more > > cells to the same partition, which creates a large amount of overhead and > > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and > > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve > > compression ratio, how could I merge the cells on the actual Cassandra > > node? I looked at ICompress and it provides only byte-level compression. > > > > Could I do this on the compaction phase, by extending the > > DateTieredCompaction for example? It has SSTableReader/Writer facilities > > and it seems to be able to see the rows? I'm fine with the fact that > repair > > run might have to do some conflict resolution as the final merged rows > > would be quite "small" (50kB) in size. The naive approach is of course to > > fetch all the rows from Cassandra - merge them on the client and send > back > > to the Cassandra, but this seems very wasteful and has its own problems. > > Compared to table-LZ4 I was able to reduce the required size to 1/20th > > (context-aware compression is sometimes just so much better) so there are > > real benefits to this approach, even if I would probably violate multiple > > design decisions. > > > > One approach is of course to write to another storage first and once the > > blocks are ready, write them to Cassandra. But that again seems idiotic > (I > > know some people are using Kafka in front of Cassandra for example, but > > that means maintaining yet another distributed solution and defeats the > > benefit of Cassandra's easy management & scalability). > > > > Has anyone done something similar? Even planned? If I need to extend > > something in Cassandra I can accept that approach also - but as I'm not > > that familiar with Cassandra source code I could use some hints. > > > > - Micke > > >
Re: Merging cells in compaction / compression?
Look like you're asking for some sort of ETL on your C* data, why not use Spark to compress those data into blobs and use User-Defined-Function to explode them when reading ? On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> wrote: > Hi, > > No, I don't want to lose precision (if that's what you meant), but if you > meant just storing them in a larger bucket (which I could decompress either > on client side or server side). To clarify, it could be like: > > 04082016T230215.1234, value > 04082016T230225.4321, value > 04082016T230235.2563, value > 04082016T230245.1145, value > 04082016T230255.0204, value > > -> > > 04082016T230200 -> blob (that has all the points for this minute stored - > no data is lost to aggregated avgs or sums or anything). > > That's acceptable, of course the prettiest solution would be to keep this > hidden from a client so it would see while decompressing the original rows > (like with byte[] compressors), but this is acceptable for my use-case. If > this is what you meant, then yes. > > - Micke > > - Original Message - > From: "Eric Stevens" <migh...@gmail.com> > To: user@cassandra.apache.org > Sent: Thursday, August 4, 2016 10:26:30 PM > Subject: Re: Merging cells in compaction / compression? > > When you say merge cells, do you mean re-aggregating the data into courser > time buckets? > > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> wrote: > > > Hi, > > > > Considering the following example structure: > > > > CREATE TABLE data ( > > metric text, > > value double, > > time timestamp, > > PRIMARY KEY((metric), time) > > ) WITH CLUSTERING ORDER BY (time DESC) > > > > The natural inserting order is metric, value, timestamp pairs, one > > metric/value pair per second for example. That means creating more and > more > > cells to the same partition, which creates a large amount of overhead and > > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and > > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve > > compression ratio, how could I merge the cells on the actual Cassandra > > node? I looked at ICompress and it provides only byte-level compression. > > > > Could I do this on the compaction phase, by extending the > > DateTieredCompaction for example? It has SSTableReader/Writer facilities > > and it seems to be able to see the rows? I'm fine with the fact that > repair > > run might have to do some conflict resolution as the final merged rows > > would be quite "small" (50kB) in size. The naive approach is of course to > > fetch all the rows from Cassandra - merge them on the client and send > back > > to the Cassandra, but this seems very wasteful and has its own problems. > > Compared to table-LZ4 I was able to reduce the required size to 1/20th > > (context-aware compression is sometimes just so much better) so there are > > real benefits to this approach, even if I would probably violate multiple > > design decisions. > > > > One approach is of course to write to another storage first and once the > > blocks are ready, write them to Cassandra. But that again seems idiotic > (I > > know some people are using Kafka in front of Cassandra for example, but > > that means maintaining yet another distributed solution and defeats the > > benefit of Cassandra's easy management & scalability). > > > > Has anyone done something similar? Even planned? If I need to extend > > something in Cassandra I can accept that approach also - but as I'm not > > that familiar with Cassandra source code I could use some hints. > > > > - Micke > > >
Re: Merging cells in compaction / compression?
Hi, No, I don't want to lose precision (if that's what you meant), but if you meant just storing them in a larger bucket (which I could decompress either on client side or server side). To clarify, it could be like: 04082016T230215.1234, value 04082016T230225.4321, value 04082016T230235.2563, value 04082016T230245.1145, value 04082016T230255.0204, value -> 04082016T230200 -> blob (that has all the points for this minute stored - no data is lost to aggregated avgs or sums or anything). That's acceptable, of course the prettiest solution would be to keep this hidden from a client so it would see while decompressing the original rows (like with byte[] compressors), but this is acceptable for my use-case. If this is what you meant, then yes. - Micke - Original Message - From: "Eric Stevens" <migh...@gmail.com> To: user@cassandra.apache.org Sent: Thursday, August 4, 2016 10:26:30 PM Subject: Re: Merging cells in compaction / compression? When you say merge cells, do you mean re-aggregating the data into courser time buckets? On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> wrote: > Hi, > > Considering the following example structure: > > CREATE TABLE data ( > metric text, > value double, > time timestamp, > PRIMARY KEY((metric), time) > ) WITH CLUSTERING ORDER BY (time DESC) > > The natural inserting order is metric, value, timestamp pairs, one > metric/value pair per second for example. That means creating more and more > cells to the same partition, which creates a large amount of overhead and > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve > compression ratio, how could I merge the cells on the actual Cassandra > node? I looked at ICompress and it provides only byte-level compression. > > Could I do this on the compaction phase, by extending the > DateTieredCompaction for example? It has SSTableReader/Writer facilities > and it seems to be able to see the rows? I'm fine with the fact that repair > run might have to do some conflict resolution as the final merged rows > would be quite "small" (50kB) in size. The naive approach is of course to > fetch all the rows from Cassandra - merge them on the client and send back > to the Cassandra, but this seems very wasteful and has its own problems. > Compared to table-LZ4 I was able to reduce the required size to 1/20th > (context-aware compression is sometimes just so much better) so there are > real benefits to this approach, even if I would probably violate multiple > design decisions. > > One approach is of course to write to another storage first and once the > blocks are ready, write them to Cassandra. But that again seems idiotic (I > know some people are using Kafka in front of Cassandra for example, but > that means maintaining yet another distributed solution and defeats the > benefit of Cassandra's easy management & scalability). > > Has anyone done something similar? Even planned? If I need to extend > something in Cassandra I can accept that approach also - but as I'm not > that familiar with Cassandra source code I could use some hints. > > - Micke >
Re: Merging cells in compaction / compression?
When you say merge cells, do you mean re-aggregating the data into courser time buckets? On Thu, Aug 4, 2016 at 5:59 AM Michael Burmanwrote: > Hi, > > Considering the following example structure: > > CREATE TABLE data ( > metric text, > value double, > time timestamp, > PRIMARY KEY((metric), time) > ) WITH CLUSTERING ORDER BY (time DESC) > > The natural inserting order is metric, value, timestamp pairs, one > metric/value pair per second for example. That means creating more and more > cells to the same partition, which creates a large amount of overhead and > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve > compression ratio, how could I merge the cells on the actual Cassandra > node? I looked at ICompress and it provides only byte-level compression. > > Could I do this on the compaction phase, by extending the > DateTieredCompaction for example? It has SSTableReader/Writer facilities > and it seems to be able to see the rows? I'm fine with the fact that repair > run might have to do some conflict resolution as the final merged rows > would be quite "small" (50kB) in size. The naive approach is of course to > fetch all the rows from Cassandra - merge them on the client and send back > to the Cassandra, but this seems very wasteful and has its own problems. > Compared to table-LZ4 I was able to reduce the required size to 1/20th > (context-aware compression is sometimes just so much better) so there are > real benefits to this approach, even if I would probably violate multiple > design decisions. > > One approach is of course to write to another storage first and once the > blocks are ready, write them to Cassandra. But that again seems idiotic (I > know some people are using Kafka in front of Cassandra for example, but > that means maintaining yet another distributed solution and defeats the > benefit of Cassandra's easy management & scalability). > > Has anyone done something similar? Even planned? If I need to extend > something in Cassandra I can accept that approach also - but as I'm not > that familiar with Cassandra source code I could use some hints. > > - Micke >
Merging cells in compaction / compression?
Hi, Considering the following example structure: CREATE TABLE data ( metric text, value double, time timestamp, PRIMARY KEY((metric), time) ) WITH CLUSTERING ORDER BY (time DESC) The natural inserting order is metric, value, timestamp pairs, one metric/value pair per second for example. That means creating more and more cells to the same partition, which creates a large amount of overhead and reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and Deflate ~0.10 ratios in some of the examples I've run). Now, to improve compression ratio, how could I merge the cells on the actual Cassandra node? I looked at ICompress and it provides only byte-level compression. Could I do this on the compaction phase, by extending the DateTieredCompaction for example? It has SSTableReader/Writer facilities and it seems to be able to see the rows? I'm fine with the fact that repair run might have to do some conflict resolution as the final merged rows would be quite "small" (50kB) in size. The naive approach is of course to fetch all the rows from Cassandra - merge them on the client and send back to the Cassandra, but this seems very wasteful and has its own problems. Compared to table-LZ4 I was able to reduce the required size to 1/20th (context-aware compression is sometimes just so much better) so there are real benefits to this approach, even if I would probably violate multiple design decisions. One approach is of course to write to another storage first and once the blocks are ready, write them to Cassandra. But that again seems idiotic (I know some people are using Kafka in front of Cassandra for example, but that means maintaining yet another distributed solution and defeats the benefit of Cassandra's easy management & scalability). Has anyone done something similar? Even planned? If I need to extend something in Cassandra I can accept that approach also - but as I'm not that familiar with Cassandra source code I could use some hints. - Micke