[ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4226:
---------------------------------

    Attachment: LUCENE-4226.patch

Thanks for your kind words, David!

Here is a new version of the patch. I've though a lot about whether or not to 
let users configure per-field compression, but I think we should just try to 
provide something simple that improves the compression ratio by allowing 
cross-field and cross-document compression ;  People who have very specific 
needs can still implement their own {{StoredFieldsFormat}}.

Moreover I've had a discussion with Robert who argued that we should limit the 
number of classes that are exposed as a SPI because they add complexity (for 
example Solr needs to reload SPI registers every time it adds a core lib 
directory to the classpath). So I tried to make it simpler: there is no more 
{{CompressionCodec}} and people can choose between 3 different compression 
modes:
 - FAST, that uses LZ4's fast compressors and uncompressors (for indices that 
have a high update rate),
 - HIGH_COMPRESSION, that uses deflate (for people who want low compression 
ratios, no matter what the performance penalty is),
 - FAST_UNCOMPRESSION that spends more time compressing using LZ4's compress_HC 
method but still has very fast uncompression (for indices that have a 
reasonnable update rate and need good read performance).

I also added a test case and applied Dawid's advice to replace the default 
{{skipBytes}} implementation with a bulk-write into a write-only buffer.

Here is a new benchmark that shows how this new codec can help compress stored 
fields. This time, I indexed some access.log files generated by Apache HTTP 
server. A document consists of a line from the log file and is typically 
between 100 and 300 bytes. Because every line contains the date of the request, 
its path and the user-agent of the client, there is a lot of redundancy across 
documents.

{noformat}
Format            Chunk size  Compression ratio     IndexReader.document time
—————————————————————————————————————————————————————————————————————————————
uncompressed                               100%                         100%
doc/deflate 1                               90%                        1557%
doc/deflate 9                               90%                        1539%
index/FAST               512                50%                         197%
index/HIGH_COMPRESSION   512                44%                        1545%
index/FAST_UNCOMPRESSION 512                50%                         198%
{noformat}

Because documents are very small, document-level compression doesn't work well 
and only makes the .fdt file 10% smaller while loading documents from disk is 
more than 15 times slower on a hot OS cache.

However, with this kind of highly redundant input, {{CompressionMode.FAST}} 
looks very interesting as it divides the size of the .fdt file by 2 and only 
makes IndexReader.document twice slower.
                
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
>                 Key: LUCENE-4226
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4226
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Adrien Grand
>            Priority: Trivial
>         Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, 
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format           Compression ratio     IndexReader.document time
> ————————————————————————————————————————————————————————————————
> uncompressed     100%                  100%
> doc/deflate 1     59%                  616%
> doc/deflate 9     58%                  595%
> doc/snappy        80%                  129%
> index/deflate 1   49%                  966%
> index/deflate 9   46%                  938%
> index/snappy      65%                  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to