subject:"\[jira\] \[Issue Comment Edited\] \(CASSANDRA\-47\) SSTable compression"

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-26 Thread Pavel Yaskevich (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13071172#comment-13071172
]

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/26/11 4:04 PM:
---

We need to decide do we need to do this per CF or at the global level.

I don't think that mmap of the compressed file is a good idea because we anyway
won't be able to avoid a buffer copies as we do with uncompressed data (see
MappedFileDataInput). Agree with other arguments not related to mmap mode.

bq. Let's add the 'compression algorithm' in the compressionInfo component.
It's fine to hard set it to Snappy for writes and ignore the value on read
for now.

We will need that field to be fixed size or size + value because just writing a
string at the header could potentially be dangerous.

bq. In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if
(components.contains(Component.COMPRESSION_INFO))'.

I will remove one use of it in the SSTW but in the SSTR it is used in the
static method where we don't have isCompressed flag.

was (Author: xedin):
We need to decide do we need to do this per CF or at the global level.

bq. In SSTR and SSTW, we can use the isCompressed SSTable flag instead of 'if
(components.contains(Component.COMPRESSION_INFO))'.

I will remove one use of it in the SSTW but in the SSTR it is used in the
static method where we don't have isCompressed flag.

SSTable compression
---

Key: CASSANDRA-47
URL: https://issues.apache.org/jira/browse/CASSANDRA-47
Project: Cassandra
Issue Type: New Feature
Components: Core
Reporter: Jonathan Ellis
Assignee: Pavel Yaskevich
Labels: compression
Fix For: 1.0

Attachments: CASSANDRA-47-v2.patch, CASSANDRA-47-v3-rebased.patch,
CASSANDRA-47-v3.patch, CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar

We should be able to do SSTable compression which would trade CPU for I/O
(almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-19 Thread Pavel Yaskevich (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067781#comment-13067781
]

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/19/11 3:20 PM:
---

bq. A small detail though is that I would store the chunk offsets instead of
the chunk sizes, the reason being that it's more resilient to corruption
(typically, with chunk sizes, if the first entry is corrupted you're screwed,
with offsets, you only have one or two chunks that are unreadable).

+1 if we will go with a separate file. I'm thinking if we will go with a
separate file I will use the same strategy as I did in v1 - store chunk size at
the beginning of the chunk and re-read it instead of keeping it in memory
(lowers memory usage for larger files).

bq. After all, CompressedDataFile is just a BRAF with a fixed buffer size, and
a mechanism to translate pre-compaction file position to compressed file
position (roughly). So I'm pretty sure it should be possible to have
CompressedDataFile extend BRAF with minimum refactoring (of BRAF that is). It
would also lift for free the limitation of not have read-write compressed file
(not that we use them but ...).

To extend BRAF we will need to split it into Input/Output classes which will
imply refactoring of skip cache functionality and other parts of that class.
I'd rather create a separate issue to do that after compression is committed
instead of putting all eggs in one basket.

+1 on everything else.

was (Author: xedin):
bq. A small detail though is that I would store the chunk offsets instead
of the chunk sizes, the reason being that it's more resilient to corruption
(typically, with chunk sizes, if the first entry is corrupted you're screwed,
with offsets, you only have one or two chunks that are unreadable).

To extend BRAF will will need to split it into Input/Output classes which will
imply refactoring of skip cache functionality and other parts of that class.
I'd rather create a separate issue to do that after compression is committed
instead of putting all eggs in one basket.

+1 on everything else.

SSTable compression
---

Attachments: CASSANDRA-47-v2.patch, CASSANDRA-47.patch,
snappy-java-1.0.3-rc4.jar

We should be able to do SSTable compression which would trade CPU for I/O
(almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-12 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064005#comment-13064005
 ] 

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/12/11 5:09 PM:
---

Thanks for your report! This will be fixed in next patch.

  was (Author: xedin):
Thanks for you report! This will be fixed in next patch.
  
 SSTable compression
 ---

 Key: CASSANDRA-47
 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Pavel Yaskevich
  Labels: compression
 Fix For: 1.0

 Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar


 We should be able to do SSTable compression which would trade CPU for I/O 
 (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-08 Thread Pavel Yaskevich (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062204#comment-13062204
]

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/8/11 9:46 PM:
--

Patch introduces CompressedDataFile with Input/Output classes. Snappy is used
for compression/decompression because it showed better speeds in tests
comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes
hold information about compressed chunk size, not that current SSTable file
format is preserved and no modifications were made to index, statistics or
filter components. Both Input and Output classes extend RandomAccessFile so
random I/O works as expected.

All SSTable files are opened using CompressedDataFile.Input. On startup when
SSTableReader.open gets called it first checks if data file is already
compressed and compresses if it was not already compressed so users won't have
a problem after they update.

At the header of the file it reserves 8 bytes for a real data size so other
components of the system that use SSTables and SSTables itself have no idea
that data file is compressed.

Streaming of data file sends decompressed chunks for convenience of maintaing
transfer and receiving party compresses all data before write to the backing
file (see CompressedDataFile.transfer(...) and CompressedFileReceiver class).

Tests are showing dramatic performance increase when reading 1 million rows
created with 1024 bytes random values. Current code takes 1000 secs to read
but with current path only 175 secs. Using 64kb buffer 1.7GB file could be
compressed into 110MB (data added using ./bin/stress -n 100 -S 1024 -r,
where -r option generates random values).

Writes perform a bit better like 5-10%.

was (Author: xedin):
Patch introduces CompressedDataFile with Input/Output classes. Snappy is
used for compression/decompression because it showed better speeds in tests
comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes
hold information about compressed chunk size. Both Input and Output classes
extend RandomAccessFile so random I/O works as expected.

At the header of the file it reserves 8 bytes for a real data size so other
components of the system that use SSTables and SSTables itself have no idea
that data file is compressed.

Writes perform a bit better like 5-10%.

SSTable compression
---

Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar

We should be able to do SSTable compression which would trade CPU for I/O
(almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-08 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062228#comment-13062228
 ] 

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/8/11 10:21 PM:
---

bq. The -r flag generates random keys: unless you modified stress.java, the 
values will be the same for every row.

oh, sorry! I meant -V not -r also used various cardinality 50-250 in the tests

  was (Author: xedin):
bq. The -r flag generates random keys: unless you modified stress.java, the 
values will be the same for every row.

CASSANDRA-2266 . Also used various cardinality 50-250 in the tests
  
 SSTable compression
 ---

 Key: CASSANDRA-47
 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Pavel Yaskevich
  Labels: compression
 Fix For: 1.0

 Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar


 We should be able to do SSTable compression which would trade CPU for I/O 
 (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-08 Thread Pavel Yaskevich (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062204#comment-13062204
]

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/8/11 10:24 PM:
---

At the header of the file it reserves 8 bytes for a real data size so other
components of the system that use SSTables and SSTables itself have no idea
that data file is compressed.

Writes perform a bit better like 5-10%.

was (Author: xedin):
Patch introduces CompressedDataFile with Input/Output classes. Snappy is
used for compression/decompression because it showed better speeds in tests
comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes
hold information about compressed chunk size, not that current SSTable file
format is preserved and no modifications were made to index, statistics or
filter components. Both Input and Output classes extend RandomAccessFile so
random I/O works as expected.

At the header of the file it reserves 8 bytes for a real data size so other
components of the system that use SSTables and SSTables itself have no idea
that data file is compressed.

Writes perform a bit better like 5-10%.

SSTable compression
---

Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar

We should be able to do SSTable compression which would trade CPU for I/O
(almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-08 Thread Pavel Yaskevich (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062204#comment-13062204
]

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/8/11 10:22 PM:
---

At the header of the file it reserves 8 bytes for a real data size so other
components of the system that use SSTables and SSTables itself have no idea
that data file is compressed.

Writes perform a bit better like 5-10%.

was (Author: xedin):
Patch introduces CompressedDataFile with Input/Output classes. Snappy is
used for compression/decompression because it showed better speeds in tests
comparing to ning. Files are split into 4 bytes + 64kb chunks where 4 bytes
hold information about compressed chunk size, not that current SSTable file
format is preserved and no modifications were made to index, statistics or
filter components. Both Input and Output classes extend RandomAccessFile so
random I/O works as expected.

At the header of the file it reserves 8 bytes for a real data size so other
components of the system that use SSTables and SSTables itself have no idea
that data file is compressed.

Writes perform a bit better like 5-10%.

SSTable compression
---

Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar

We should be able to do SSTable compression which would trade CPU for I/O
(almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-07-08 Thread Pavel Yaskevich (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062240#comment-13062240
 ] 

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/9/11 12:14 AM:
---

It just refers to uncompressed locations, I didn't see a need to change that.

  was (Author: xedin):
It just refers to uncompressed locations, I didn't see a need to change to 
change that.
  
 SSTable compression
 ---

 Key: CASSANDRA-47
 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Pavel Yaskevich
  Labels: compression
 Fix For: 1.0

 Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar


 We should be able to do SSTable compression which would trade CPU for I/O 
 (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-05-13 Thread Terje Marthinussen (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033467#comment-13033467
 ] 

Terje Marthinussen edited comment on CASSANDRA-47 at 5/14/11 5:45 AM:
--

Just curious if any active work is done or planned near future on compressing 
larger data blocks or is it all suspended waiting for a new sstable design?

Having played with compression of just supercolumns for a while, I am a bit 
tempted to test out compression of larger blocks of data. At least row level 
compression seems reasonably easy to do.

Some experiences so far which may be usefull:
- Compression on sstables may actually be helpfull on memory pressure, but with 
my current implementation, non-batched update throughput may drop 50%.I am not 
100% sure why actually.

- Flushing of (compressed) memtables and compactions are clear potential 
bottlenecks
The obvious trouble makers here is the fact that you ceep 

For really high pressure work, I think it would be usefull to only compress 
tables once they pass a certain size to reduce the amount of recompression 
occuring on memtable flushes and when compacting small sstables (which is 
generally not a big disk problem anyway)

This is a bit awkward when doing things like I do in the super columns as I 
believe the supercolumn does not know anything about the data it is part of 
(except for recently, the deserializer has that info through inner.

It would anyway probably be cleaner to let the datastructures/methods using the 
SC decide when to compress and noth 


- Working on a SC level, there seems to be some 10-15% extra compression on 
this specific data if column names that are highly repetetive in SC's can be 
extracted into some meta data structure so you only store references to these 
in the column names. That is, the final data is goes from about 40% compression 
to 50% compression. 

I don't think the effect of this will be equally big with larger blocks, but I 
suspect there should be some effect.

- total size reduction of the sstables when using a dictionary for column names 
as well as timestamps and variable length lenght fields, is currently in the 
60-65% range. It is however mainly beneficial for those that have supercolumns 
with at least a handfull of columns (400-600 bytes of serialized column data 
per sc at least)


- Reducing the meta data on columns by building a dictionary of timestamps as 
well as variable length name/value length data (instead of fixed short/int) 
cuts down another 10% in my test (I have just done a very quick simulation of 
this by a very quick 10 minute hack on the serializer)

- We may want to look at how we can reuse whole compressed rows on compactions 
if for instance the other tables you compact with do not have the same data

- We may want a new cache on the uncompressed disk chunks. In my case, I 
preserve the compressed part of the supercolumn and 

In my supercolumn compression case, I have a cache for the compressed data so I 
can write that back without recompression if not modified. This also makes 
calls to get the serialized size cheaper (don't need to compress both to find 
serialized size and to actually serialize)

If people are interested in adding any of the above to current cassandra, I 
will try to get time to make some of this up to a quality where it could be 
used by the general public. 

If not, I will wait for new sstables to get a bit more ready and see if I can 
contribute there instead.

  was (Author: terjem):
Just curious if any active work is done or planned near future on 
compressing larger data blocks or is it all suspended waiting for a new sstable 
design?

Having played with compression of just supercolumns for a while, I am a bit 
tempted to test out compression of larger blocks of data. At least row level 
compression seems reasonably easy to do.

Some experiences so far which may be usefull:
- Compression on sstables may actually be helpfull on memory pressure, but with 
my current implementation, non-batched update throughput may drop 50%.I am not 
100% sure why actually.

- Flushing of (compressed) memtables and compactions are clear potential 
bottlenecks
The obvious trouble makers here is the fact that you ceep 

For really high pressure work, I think it would be usefull to only compress 
tables once they pass a certain size to reduce the amount of recompression 
occuring on memtable flushes and when compacting small sstables (which is 
generally not a big disk problem anyway)

This is a bit awkward when doing things like I do in the super columns as I 
believe the supercolumn does not know anything about the data it is part of 
(except for recently, the deserializer has that info through inner.

It would anyway probably be cleaner to let the datastructures/methods using the 
SC decide when to compress and noth 


- Working on a SC

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

2011-03-23 Thread Brandon Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13010197#comment-13010197
 ] 

Brandon Williams edited comment on CASSANDRA-47 at 3/23/11 4:16 PM:


I think this idea hits the sweet spot where we currently stand.  Compression is 
a *huge* win for us, and not having to rewrite the entire format simplifies the 
complexity greatly.

  was (Author: brandon.williams):
I think is idea hits the sweet spot where we currently stand.  Compression 
is a *huge* win for us, and not having to rewrite the entire format simplifies 
the complexity greatly.
  
 SSTable compression
 ---

 Key: CASSANDRA-47
 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Priority: Minor
 Fix For: 0.8


 We should be able to do SSTable compression which would trade CPU for I/O 
 (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

[jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression

10 matches

Site Navigation

Mail list logo

Footer information