Re. Jonathan, I haven't run across a row-oriented use case where symbolizing merely the first 1000 column names seen would not work.
Re. Stu, If generalized compression can cover this case that should be fine....burn some CPU for a more straightforward implementation. However, it's often very useful in databases to have transparent compression (that is, operations can be performed on the data even in its compressed state). So I would advocate not merely passing the row blobs through LZW or similar. Aggregation operations benefit in particular because you can often never even bother to decompress the rows. This isn't relevant with current Cassandra, but could be a boon to in-database stored procedures and the like. Evan On Sun, Jul 26, 2009 at 2:11 PM, Stu Hood<[email protected]> wrote: > Also, long term, I think it is safe to assume that we will be adding > compression for ColumnFamilies, which should have similar positive effects on > cache-ability without too much application specific optimization. > > > -----Original Message----- > From: "Jonathan Ellis" <[email protected]> > Sent: Sunday, July 26, 2009 4:46pm > To: [email protected] > Subject: Re: Symbolizing column names for storage and cache efficiency > > On Sun, Jul 26, 2009 at 2:28 AM, Evan Weaver<[email protected]> wrote: >> Would it be possible to add symbolized column names in a >> forward-compatible way? Maybe scoped per sstable, with the registries >> always kept in memory. > > Maybe. But it's not obvious to me how to do this in general. > > The problem is the sparse nature of the column set. We can't encode > _all_ the columns this way, or in the degenerate case we OOM just > trying to keep the mapping in memory. Similarly, we can't encode just > the top N column names, since figuring out the top N requires keeping > each name in memory during the counting process. (Besides slowing > down compaction -- instead of just deserializing columns where there > are keys in common in the merged fragments, we have to deserialize > all.) > > ISTM that all we can do is encode the _first_ N column names we see, > which may be useful if the column name set is small for a given CF. > > -Jonathan > > > -- Evan Weaver
