[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071139#comment-13071139
]
Sylvain Lebresne commented on CASSANDRA-47:
-------------------------------------------
I think we're on the right track. A bunch of comments though:
* I would put the compression option at the CF level. I think it will be
particularity useful at first, when people will want to test compression before
trusting it, to be able to switch a few less important CFs to compression
first. On the longer term, compression has probably a little cost for some
workload (especially considering we don't yet have mmaped compressed file), so
it'd be also useful for that. And there is the fact that it's not really more
effort on our side. Nit: I would also call the option more concisely
'compress_sstables'.
* I would move the Metadata class out of CRAR (and call it
CompressionMetadata). There is two reasons:
** It is useful outside of CRAR: an example already is
CompressedSegmentedFile. The other will be when/if we add a
MMappedCompressedSegmentFile.
** I would also move all the code related to the metadata there. For
instance, we can have a CompressionMetadata.Writer with the code related to
writing those that CSW would use. That would improve encapsulation I think.
* About the Metadata, we should really use a long[] instead of List<Long> for
the offsets, the memory size difference will be significant. Also, since this
is a fairly performance sensitive path, I would create a small Chunk class with
a long and an int field to replace the Pair<Long, Integer> returned by
chunckFor, to avoid the boxing/unboxing.
* Let's add the 'compression algorithm' in the compressionInfo component. It's
fine to hard set it to "Snappy" for writes and ignore the value on read for now.
* In CFS.scrubDataDirectory, it seems that if there is a compressionInfo
segment but no data file, we do not purge orphaned files. I believe this is
wrong.
* CRAR.transfer() should honor FileStreamTask.CHUNK_SIZE (as a max at least)
for it's buffer. In particular, having the possibility to allow a 2GB buffer as
now (and it's not even unrealistic) is not a good idea.
* CompressedSequentialWriter does not honor skipIOCache during flush. I
actually think that flush() is a bit problematic for CSW in that it should only
be called when the buffer is full (or it is the end). It's obviously true of
reBuffer and sync too then they use flush. But flush() and sync() are public.
Maybe we could rename flush and sync (in say flushInternal and syncInternal)
and make the public method throw an unsupported exception. Anyway, I think it
would be cleaner if the flush of CSW was calling the one of SW (this would
solve the skipIOCache problem in particular). We can move the buffer write into
a protected method in SW and only override that in CSW, and let the flush of SW
untouched otherwise.
* Not totally related to that patch but it seems that in SW, when you call
reBuffer, there is one call to resetBuffer in flush() and one after it. It
seems the one in reBuffer is useless (that would make reBuffer be an alias to
flush() really).
* In CSW, why not reuse the buffer where compressed data is used ?
* I think the truncate methods are not correct for CSW. The offset given is a
position in the uncompressed data, so it will not truncate where it should.
For truncateAndClose(), we should probably not have it public anyway: the only
place it is used, it's really close() that you want to use, and we can do "the
right thing" in close for CSW. For truncate, it should probably throw an
unsupported exception.
For ResetAndTruncate, I think we have a harder problem. What the method
should do is: seek back to the chunk containing the wanted destination (imply
keeping chunk offsets in memory), load the chunk in the buffer and position the
bufferOffset correctly.
And to end, a few nits:
* In SegmentedFile, I would prefer adding a 'compressed' flag to getBuilder()
instead of having getCompressedBuilder. If only because we should probably add
a mmapped compressed mode at some point.
* There's an useless added import in CompactionManager. Also a wrongly placed
import in SSTW.
* In CRAR, I would use metada.chunkLength instead buffer.length in most places
(decompressChunk mainly). I know it is the same value right now, but it's
really chunkLength that we want here so it feels more clear (nothing prevents
us from using a buffer larger than chunkLength, not that it would be useful)
* In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if
(components.contains(Component.COMPRESSION_INFO))'.
> SSTable compression
> -------------------
>
> Key: CASSANDRA-47
> URL: https://issues.apache.org/jira/browse/CASSANDRA-47
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Pavel Yaskevich
> Labels: compression
> Fix For: 1.0
>
> Attachments: CASSANDRA-47-v2.patch, CASSANDRA-47-v3-rebased.patch,
> CASSANDRA-47-v3.patch, CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar
>
>
> We should be able to do SSTable compression which would trade CPU for I/O
> (almost always a good trade).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira