Adrien Grand created LUCENE-4226:
------------------------------------

             Summary: Efficient compression of small to medium stored fields
                 Key: LUCENE-4226
                 URL: https://issues.apache.org/jira/browse/LUCENE-4226
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/index
            Reporter: Adrien Grand
            Priority: Trivial


I've been doing some experiments with stored fields lately. It is very common 
for an index with stored fields enabled to have most of its space used by the 
.fdt index file. To prevent this .fdt file from growing too much, one option is 
to compress stored fields. Although compression works rather well for large 
fields, this is not the case for small fields and the compression ratio can be 
very close to 100%, even with efficient compression algorithms.

In order to improve the compression ratio for small fields, I've written a 
{{StoredFieldsFormat}} that compresses several documents in a single chunk of 
data. To see how it behaves in terms of document deserialization speed and 
compression ratio, I've run several tests with different index compression 
strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
were indexed and stored):
 - no compression,
 - docs compressed with deflate (compression level = 1),
 - docs compressed with deflate (compression level = 9),
 - docs compressed with Snappy,
 - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
chunks of 6 docs,
 - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
chunks of 6 docs,
 - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
docs.

For those who don't know Snappy, it is compression algorithm from Google which 
has very high compression ratios, but compresses and decompresses data very 
quickly.

{noformat}
Format           Compression ratio     IndexReader.document time
————————————————————————————————————————————————————————————————
uncompressed     100%                  100%
doc/deflate 1     59%                  616%
doc/deflate 9     58%                  595%
doc/snappy        80%                  129%
index/deflate 1   49%                  966%
index/deflate 9   46%                  938%
index/snappy      65%                  264%
{noformat}

(doc = doc-level compression, index = index-level compression)

I find it interesting because it allows to trade speed for space (with deflate, 
the .fdt file shrinks by a factor of 2, much better than with doc-level 
compression). One other interesting thing is that {{index/snappy}} is almost as 
compact as {{doc/deflate}} while it is more than 2x faster at retrieving 
documents from disk.

These tests have been done on a hot OS cache, which is the worst case for 
compressed fields (one can expect better results for formats that have a high 
compression ratio since they probably require fewer read/write operations from 
disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to