As a related example, a columnar database I know uses RLE compression on value streams because it's transparent. This makes calculating statistical features of a compressed column faster than if it was uncompressed in many use cases.
Since we have unpredictable column offsets, column-specific lookups would benefit from similar transparency considerations. Evan On Sun, Jul 26, 2009 at 3:03 PM, Stu Hood<[email protected]> wrote: >> blobs through LZW or similar. Aggregation operations benefit in >> particular because you can often never even bother to decompress the >> rows. > This is an interesting consideration. Hopefully a suitably flexible > implementation of pluggable codecs would be able to allow for 'keys only' > compression transparently. Your aggregation example could be very optimized > in this case. > > PS: When I envision SSTable compression, I'm definitely thinking of block > level compression, so each codec would need to implement seek() at a block > level. The SSTable index would then point at the first compressed block > containing a key. > > > -----Original Message----- > From: "Evan Weaver" <[email protected]> > Sent: Sunday, July 26, 2009 5:23pm > To: [email protected] > Subject: Re: Symbolizing column names for storage and cache efficiency > > Re. Jonathan, I haven't run across a row-oriented use case where > symbolizing merely the first 1000 column names seen would not work. > > Re. Stu, If generalized compression can cover this case that should be > fine....burn some CPU for a more straightforward implementation. > > However, it's often very useful in databases to have transparent > compression (that is, operations can be performed on the data even in > its compressed state). So I would advocate not merely passing the row > blobs through LZW or similar. Aggregation operations benefit in > particular because you can often never even bother to decompress the > rows. > > This isn't relevant with current Cassandra, but could be a boon to > in-database stored procedures and the like. > > Evan > On Sun, Jul 26, 2009 at 2:11 PM, Stu Hood<[email protected]> wrote: >> Also, long term, I think it is safe to assume that we will be adding >> compression for ColumnFamilies, which should have similar positive effects >> on cache-ability without too much application specific optimization. >> >> >> -----Original Message----- >> From: "Jonathan Ellis" <[email protected]> >> Sent: Sunday, July 26, 2009 4:46pm >> To: [email protected] >> Subject: Re: Symbolizing column names for storage and cache efficiency >> >> On Sun, Jul 26, 2009 at 2:28 AM, Evan Weaver<[email protected]> wrote: >>> Would it be possible to add symbolized column names in a >>> forward-compatible way? Maybe scoped per sstable, with the registries >>> always kept in memory. >> >> Maybe. Â But it's not obvious to me how to do this in general. >> >> The problem is the sparse nature of the column set. Â We can't encode >> _all_ the columns this way, or in the degenerate case we OOM just >> trying to keep the mapping in memory. Â Similarly, we can't encode just >> the top N column names, since figuring out the top N requires keeping >> each name in memory during the counting process. Â (Besides slowing >> down compaction -- instead of just deserializing columns where there >> are keys in common in the merged fragments, we have to deserialize >> all.) >> >> ISTM that all we can do is encode the _first_ N column names we see, >> which may be useful if the column name set is small for a given CF. >> >> -Jonathan >> >> >> > > > > -- > Evan Weaver > > > -- Evan Weaver
