Re: Symbolizing column names for storage and cache efficiency

Evan Weaver Sun, 26 Jul 2009 15:38:30 -0700

As a related example, a columnar database I know uses RLE compression
on value streams because it's transparent. This makes calculating
statistical features of a compressed column faster than if it was
uncompressed in many use cases.


Since we have unpredictable column offsets, column-specific lookups
would benefit from similar transparency considerations.

Evan

On Sun, Jul 26, 2009 at 3:03 PM, Stu Hood<[email protected]> wrote:
>> blobs through LZW or similar. Aggregation operations benefit in
>> particular because you can often never even bother to decompress the
>> rows.
> This is an interesting consideration. Hopefully a suitably flexible 
> implementation of pluggable codecs would be able to allow for 'keys only' 
> compression transparently. Your aggregation example could be very optimized 
> in this case.
>
> PS: When I envision SSTable compression, I'm definitely thinking of block 
> level compression, so each codec would need to implement seek() at a block 
> level. The SSTable index would then point at the first compressed block 
> containing a key.
>
>
> -----Original Message-----
> From: "Evan Weaver" <[email protected]>
> Sent: Sunday, July 26, 2009 5:23pm
> To: [email protected]
> Subject: Re: Symbolizing column names for storage and cache efficiency
>
> Re. Jonathan, I haven't run across a row-oriented use case where
> symbolizing merely the first 1000 column names seen would not work.
>
> Re. Stu, If generalized compression can cover this case that should be
> fine....burn some CPU for a more straightforward implementation.
>
> However, it's often very useful in databases to have transparent
> compression (that is, operations can be performed on the data even in
> its compressed state). So I would advocate not merely passing the row
> blobs through LZW or similar. Aggregation operations benefit in
> particular because you can often never even bother to decompress the
> rows.
>
> This isn't relevant with current Cassandra, but could be a boon to
> in-database stored procedures and the like.
>
> Evan
> On Sun, Jul 26, 2009 at 2:11 PM, Stu Hood<[email protected]> wrote:
>> Also, long term, I think it is safe to assume that we will be adding 
>> compression for ColumnFamilies, which should have similar positive effects 
>> on cache-ability without too much application specific optimization.
>>
>>
>> -----Original Message-----
>> From: "Jonathan Ellis" <[email protected]>
>> Sent: Sunday, July 26, 2009 4:46pm
>> To: [email protected]
>> Subject: Re: Symbolizing column names for storage and cache efficiency
>>
>> On Sun, Jul 26, 2009 at 2:28 AM, Evan Weaver<[email protected]> wrote:
>>> Would it be possible to add symbolized column names in a
>>> forward-compatible way? Maybe scoped per sstable, with the registries
>>> always kept in memory.
>>
>> Maybe. Â But it's not obvious to me how to do this in general.
>>
>> The problem is the sparse nature of the column set. Â We can't encode
>> _all_ the columns this way, or in the degenerate case we OOM just
>> trying to keep the mapping in memory. Â Similarly, we can't encode just
>> the top N column names, since figuring out the top N requires keeping
>> each name in memory during the counting process. Â (Besides slowing
>> down compaction -- instead of just deserializing columns where there
>> are keys in common in the merged fragments, we have to deserialize
>> all.)
>>
>> ISTM that all we can do is encode the _first_ N column names we see,
>> which may be useful if the column name set is small for a given CF.
>>
>> -Jonathan
>>
>>
>>
>
>
>
> --
> Evan Weaver
>
>
>



-- 
Evan Weaver

Re: Symbolizing column names for storage and cache efficiency

Reply via email to