[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033467#comment-13033467
]
Terje Marthinussen commented on CASSANDRA-47:
---------------------------------------------
Just curious if any active work is done or planned near future on compressing
larger data blocks or is it all suspended waiting for a new sstable design?
Having played with compression of just supercolumns for a while, I am a bit
tempted to test out compression of larger blocks of data. At least row level
compression seems reasonably easy to do.
Some experiences so far which may be usefull:
- Compression on sstables may actually be helpfull on memory pressure, but with
my current implementation, non-batched update throughput may drop 50%.I am not
100% sure why actually.
- Flushing of (compressed) memtables and compactions are clear potential
bottlenecks
The obvious trouble makers here is the fact that you ceep
For really high pressure work, I think it would be usefull to only compress
tables once they pass a certain size to reduce the amount of recompression
occuring on memtable flushes and when compacting small sstables (which is
generally not a big disk problem anyway)
This is a bit awkward when doing things like I do in the super columns as I
believe the supercolumn does not know anything about the data it is part of
(except for recently, the deserializer has that info through "inner".
It would anyway probably be cleaner to let the datastructures/methods using the
SC decide when to compress and noth
- Working on a SC level, there seems to be some 10-15% extra compression on
this specific data if column names that are highly repetetive in SC's can be
extracted into some meta data structure so you only store references to these
in the column names. That is, the final data is goes from about 40% compression
to 50% compression.
I don't think the effect of this will be equally big with larger blocks, but I
suspect there should be some effect.
- total size reduction of the sstables I have in this project is currently in
the 60-65% range. It is mainly beneficial for those that have supercolumns with
at least a handfull of columns (400-600 bytes of serialized column data per sc
at least)
- Reducing the meta data on columns by building a dictionary of timestamps as
well as variable length name/value length data (instead of fixed short/int)
cuts down another 10% in my test (I have just done a very quick simulation of
this by a very quick "10 minute" hack on the serializer)
- We may want to look at how we can reuse whole compressed rows on compactions
if for instance the other tables you compact with do not have the same data
- We may want a new cache on the uncompressed disk chunks. In my case, I
preserve the compressed part of the supercolumn and
In my supercolumn compression case, I have a cache for the compressed data so I
can write that back without recompression if not modified. This also makes
calls to get the serialized size cheaper (don't need to compress both to find
serialized size and to actually serialize)
If people are interested in adding any of the above to current cassandra, I
will try to get time to make some of this up to a quality where it could be
used by the general public.
If not, I will wait for new sstables to get a bit more ready and see if I can
contribute there instead.
> SSTable compression
> -------------------
>
> Key: CASSANDRA-47
> URL: https://issues.apache.org/jira/browse/CASSANDRA-47
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Priority: Minor
> Labels: compression
> Fix For: 1.0
>
>
> We should be able to do SSTable compression which would trade CPU for I/O
> (almost always a good trade).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira