[
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451867#comment-13451867
]
Adrien Grand commented on LUCENE-4226:
--------------------------------------
Otis shared a link to this issue on Twitter
https://twitter.com/otisg/status/244996292743405571 and some people seem to
wonder how it compares to ElasticSearch's block compression.
ElasticSearch's block compression uses a similar idea: data is compressed into
blocks (with fixed sizes that are independent from document sizes). It is based
on a CompressedIndexInput/CompressedIndexOutput: Upon closing,
CompressedIndexOutput writes a metadata table at the end of the wrapped output
that contains the start offset of every compressed block. Upon creation, a
CompressedIndexInput first loads this metadata table into memory and can then
use it whenever it needs to seek. This is probably the best way to compress
small docs with Lucene 3.x.
With this patch, the size of blocks is not completely independent from document
sizes: I make sure that documents don't spread across compressed blocks so that
reading a document never requires more than one block to be uncompressed.
Moreover, the LZ4 uncompressor (used by FAST and FAST_UNCOMPRESSION) can stop
uncompressing whenever it has uncompressed enough data. So unless you need the
last document of a compressed block, it is very likely that the uncompressor
won't uncompress the whole block before returning.
Therefore I expect this StoredFieldsFormat to have a similar compression ratio
to ElasticSearch's block compression (provided that similar compression
algorithms are used) but to be a little faster at loading documents from disk.
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
> Key: LUCENE-4226
> URL: https://issues.apache.org/jira/browse/LUCENE-4226
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Adrien Grand
> Priority: Trivial
> Attachments: CompressionBenchmark.java, CompressionBenchmark.java,
> LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch,
> SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common
> for an index with stored fields enabled to have most of its space used by the
> .fdt index file. To prevent this .fdt file from growing too much, one option
> is to compress stored fields. Although compression works rather well for
> large fields, this is not the case for small fields and the compression ratio
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of
> data. To see how it behaves in terms of document deserialization speed and
> compression ratio, I've run several tests with different index compression
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text
> were indexed and stored):
> - no compression,
> - docs compressed with deflate (compression level = 1),
> - docs compressed with deflate (compression level = 9),
> - docs compressed with Snappy,
> - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and
> chunks of 6 docs,
> - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and
> chunks of 6 docs,
> - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6
> docs.
> For those who don't know Snappy, it is compression algorithm from Google
> which has very high compression ratios, but compresses and decompresses data
> very quickly.
> {noformat}
> Format Compression ratio IndexReader.document time
> ————————————————————————————————————————————————————————————————
> uncompressed 100% 100%
> doc/deflate 1 59% 616%
> doc/deflate 9 58% 595%
> doc/snappy 80% 129%
> index/deflate 1 49% 966%
> index/deflate 9 46% 938%
> index/snappy 65% 264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with
> deflate, the .fdt file shrinks by a factor of 2, much better than with
> doc-level compression). One other interesting thing is that {{index/snappy}}
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for
> compressed fields (one can expect better results for formats that have a high
> compression ratio since they probably require fewer read/write operations
> from disk).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]