[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062204#comment-13062204
 ] 

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/8/11 9:46 PM:
------------------------------------------------------------------

Patch introduces CompressedDataFile with Input/Output classes. Snappy is used 
for compression/decompression because it showed better speeds in tests 
comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes 
hold information about compressed chunk size, not that current SSTable file 
format is preserved and no modifications were made to index, statistics or 
filter components. Both Input and Output classes extend RandomAccessFile so 
random I/O works as expected.

All SSTable files are opened using CompressedDataFile.Input. On startup when 
SSTableReader.open gets called it first checks if data file is already 
compressed and compresses if it was not already compressed so users won't have 
a problem after they update.

At the header of the file it reserves 8 bytes for a "real data size" so other 
components of the system that use SSTables and SSTables itself have no idea 
that data file is compressed.

Streaming of data file sends decompressed chunks for convenience of maintaing 
transfer and receiving party compresses all data before write to the backing 
file (see CompressedDataFile.transfer(...) and CompressedFileReceiver class).

Tests are showing dramatic performance increase when reading 1 million rows 
created with 1024 bytes random values. Current code takes >> 1000 secs to read 
but with current path only 175 secs. Using 64kb buffer 1.7GB file could be 
compressed into 110MB (data added using ./bin/stress -n 1000000 -S 1024 -r, 
where -r option generates random values).

Writes perform a bit better like 5-10%. 

      was (Author: xedin):
    Patch introduces CompressedDataFile with Input/Output classes. Snappy is 
used for compression/decompression because it showed better speeds in tests 
comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes 
hold information about compressed chunk size. Both Input and Output classes 
extend RandomAccessFile so random I/O works as expected.

All SSTable files are opened using CompressedDataFile.Input. On startup when 
SSTableReader.open gets called it first checks if data file is already 
compressed and compresses if it was not already compressed so users won't have 
a problem after they update.

At the header of the file it reserves 8 bytes for a "real data size" so other 
components of the system that use SSTables and SSTables itself have no idea 
that data file is compressed.

Streaming of data file sends decompressed chunks for convenience of maintaing 
transfer and receiving party compresses all data before write to the backing 
file (see CompressedDataFile.transfer(...) and CompressedFileReceiver class).

Tests are showing dramatic performance increase when reading 1 million rows 
created with 1024 bytes random values. Current code takes >> 1000 secs to read 
but with current path only 175 secs. Using 64kb buffer 1.7GB file could be 
compressed into 110MB (data added using ./bin/stress -n 1000000 -S 1024 -r, 
where -r option generates random values).

Writes perform a bit better like 5-10%. 
  
> SSTable compression
> -------------------
>
>                 Key: CASSANDRA-47
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>              Labels: compression
>             Fix For: 1.0
>
>         Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar
>
>
> We should be able to do SSTable compression which would trade CPU for I/O 
> (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to