[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4226:
---------------------------------

    Attachment: CompressionBenchmark.java
                LUCENE-4226.patch

New patch as well as the code I used to benchmark.

Documents are still compressed into chunks, but I removed the ability to select 
the compression algorithm on a per-field basis in order to make the patch 
simpler and to handle cross-field compression.

I also added an index in front of compressed data using packed ints, so that 
uncompressors can stop uncompressing when enough data has been uncompressed.

The JDK only includes a moderately fast compression algorithm (deflate), but 
for this kind of use-case, we would probably be more interested in fast 
compression and uncompression algorithms such as LZ4 
(http://code.google.com/p/lz4/) or Snappy (http://code.google.com/p/snappy/). 
Since lucene-core has no dependency, I ported LZ4 to Java (included in the 
patch, see o.a.l.util.compress).

LZ4 has a very fast uncompressor and two compression modes :
 - fast scan, which looks for the last offset in the stream that has at least 4 
common bytes (using a hash table) and adds a reference to it,
 - high compression, which looks for the last 256 offsets in the stream that 
have at least 4 common bytes, takes the one that has the longest common 
sequence, and then performs trade-offs between overlapping matches in order to 
improve the compression ratio.

(In case you are curious about LZ4, I did some benchmarking with other 
compression algorithms in 
http://blog.jpountz.net/post/28092106032/wow-lz4-is-fast, unfortunately the 
high-compression Java impl is not included in the benchmark.)

I ran a similar benchmark as for my first patch, but this time I only 
compressed and stored the 1kb text field (the title field being too small was 
unfair for document-level compression with deflate). Here are the results :

{noformat}
Format           Chunk size  Compression ratio     IndexReader.document time
————————————————————————————————————————————————————————————————————————————
uncompressed                               100%                         100%
doc/deflate 1                               58%                         579%
doc/deflate 9                               57%                         577%
index/deflate 1          4K                 50%                        1057%
index/deflate 9          4K                 48%                        1037%
index/lz4 scan           4K                 70%                         329%
index/lz4 hc             4K                 66%                         321%
index/deflate 1           1                 60%                         457%
index/deflate 9           1                 59%                         454%
index/lz4 scan            1                 81%                         171%
index/lz4 hc              1                 79%                         176%
{noformat}

NOTE: chunk size = 1 means that there was only one document in the chunk (there 
is a compress+flush every time the byte size of documents is >= the chunk size).

NOTE: these number have been computed with the whole index fitting in the I/O 
cache. The performance should be more in favor of the compressing formats as 
soon as the index does not fit in the I/O cache anymore.

There are still a few nocommits in the patch, but it should be easy to get rid 
of them. I'd be very happy to have some feedback. :-)
                
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
>                 Key: LUCENE-4226
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4226
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Adrien Grand
>            Priority: Trivial
>         Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format           Compression ratio     IndexReader.document time
> ————————————————————————————————————————————————————————————————
> uncompressed     100%                  100%
> doc/deflate 1     59%                  616%
> doc/deflate 9     58%                  595%
> doc/snappy        80%                  129%
> index/deflate 1   49%                  966%
> index/deflate 9   46%                  938%
> index/snappy      65%                  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to