[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067762#comment-13067762
]
Sylvain Lebresne commented on CASSANDRA-47:
-------------------------------------------
Ok, I think I like the idea of keeping the index of chunk sizes. Mostly because
it avoids having to change the index (and generally make it so that less part
of the code has to be aware that we use compression underneath) and also
because it is more compact. A small detail though is that I would store the
chunk offsets instead of the chunk sizes, the reason being that it's more
resilient to corruption (typically, with chunk sizes, if the first entry is
corrupted you're screwed, with offsets, you only have one or two chunks that
are unreadable). And in memory, I would probably just store those offsets as a
long[], this would be much more compact (my guessestimate is that it will be in
the order of 6x small than the list of pair with compressed pointers), and
computing the chunk size is just a trivial (and fast) computation.
I would prefer putting this index and the header into a separate component (a
-Compression component ?). Admittedly this is in good part out of personal
preference (I like that the -Data file contains only data) and symmetry with
what we have so far, but it would also avoid to have to wait that we close the
file to write those metadata, which is nice.
Talking about the header, the control bytes detection is not correct since we
haven't done this so far: there is no guarantee an existing data file won't
start by the bytes 'C' then 'D' (having or not having a -Compression component
could serve this purpose though).
In that component, there is also at least a few other informations I would want
to add:
* a version number at the beginning (it's always useful)
* the compression algorithm
* the chunk size
Even if a first version of the patch don't allow to configure those, it's
likely we'll change that soon and it's just a string and an int to add, so
better plan ahead.
As I said in a previous comment, we also really need to have compression an
option. And actually I think this patch have quite a bit of code duplication
with BRAF. After all, CompressedDataFile is just a BRAF with a fixed buffer
size, and a mechanism to translate pre-compaction file position to compressed
file position (roughly). So I'm pretty sure it should be possible to have
CompressedDataFile extend BRAF with minimum refactoring (of BRAF that is). It
would also lift for free the limitation of not have read-write compressed file
(not that we use them but ...).
> SSTable compression
> -------------------
>
> Key: CASSANDRA-47
> URL: https://issues.apache.org/jira/browse/CASSANDRA-47
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Pavel Yaskevich
> Labels: compression
> Fix For: 1.0
>
> Attachments: CASSANDRA-47-v2.patch, CASSANDRA-47.patch,
> snappy-java-1.0.3-rc4.jar
>
>
> We should be able to do SSTable compression which would trade CPU for I/O
> (almost always a good trade).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira