[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

Adrien Grand (JIRA) Wed, 29 Aug 2012 03:03:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443939#comment-13443939
 ]


Adrien Grand commented on LUCENE-4226:
--------------------------------------

Thanks Dawid and Eks for your feedback!

bq. allocate a static, write-only buffer of 4 or 8kb once and just reuse it

Right, sounds like a better default impl!

bq. ByteArrayDataOutput.java – there seems to be a class for this in 
org.apache.lucene.store.ByteArrayDataOutput?

I wanted to reuse this class, but I needed something that would grow when 
necessary (oal.store.BADO just throws an exception when you try to write past 
the end of the buffer). I could manage growth externally based on checks on the 
buffer length and calls to ArrayUtil.grow and BADO.reset but it was just as 
simple to rewrite a ByteArrayDataOutput that would manage it internally...

bq. Maybe it is worth to keep it there for really short fields. Those general 
compression algorithms are great for bigger amounts of data, but for really 
short fields there is nothing like per field compression.  Thinking about 
database usage, e.g. fields with low cardinality, or fields with restricted 
symbol set (only digits in long UID field for example). Say zip code, product 
color... is perfectly compressed using something with static dictionary 
approach (static huffman coder with escape symbol-s, at bit level, or plain 
vanilla dictionary lookup), and both of them are insanely fast and compress 
heavily.

Right, this is exactly why I implemented per-field compression first. Both 
per-field and cross-field compression have pros and cons. Cross-field 
compression allows less fine-grained tuning but on the other hand it would 
probably be a better default since the compression ratio would be better out of 
the box. Maybe we should implement both?

I was also thinking that some codecs such as this kind of per-field 
compression, but maybe even the bloom, memory, direct and pulsing postings 
formats might deserve a separate "codecs" module where we could put these 
non-default "expert" codecs.
                
> Efficient compression of small to medium stored fields
> ------------------------------------------------------
>
>                 Key: LUCENE-4226
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4226
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Adrien Grand
>            Priority: Trivial
>         Attachments: CompressionBenchmark.java, CompressionBenchmark.java, 
> LUCENE-4226.patch, LUCENE-4226.patch, SnappyCompressionAlgorithm.java
>
>
> I've been doing some experiments with stored fields lately. It is very common 
> for an index with stored fields enabled to have most of its space used by the 
> .fdt index file. To prevent this .fdt file from growing too much, one option 
> is to compress stored fields. Although compression works rather well for 
> large fields, this is not the case for small fields and the compression ratio 
> can be very close to 100%, even with efficient compression algorithms.
> In order to improve the compression ratio for small fields, I've written a 
> {{StoredFieldsFormat}} that compresses several documents in a single chunk of 
> data. To see how it behaves in terms of document deserialization speed and 
> compression ratio, I've run several tests with different index compression 
> strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text 
> were indexed and stored):
>  - no compression,
>  - docs compressed with deflate (compression level = 1),
>  - docs compressed with deflate (compression level = 9),
>  - docs compressed with Snappy,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and 
> chunks of 6 docs,
>  - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 
> docs.
> For those who don't know Snappy, it is compression algorithm from Google 
> which has very high compression ratios, but compresses and decompresses data 
> very quickly.
> {noformat}
> Format           Compression ratio     IndexReader.document time
> ————————————————————————————————————————————————————————————————
> uncompressed     100%                  100%
> doc/deflate 1     59%                  616%
> doc/deflate 9     58%                  595%
> doc/snappy        80%                  129%
> index/deflate 1   49%                  966%
> index/deflate 9   46%                  938%
> index/snappy      65%                  264%
> {noformat}
> (doc = doc-level compression, index = index-level compression)
> I find it interesting because it allows to trade speed for space (with 
> deflate, the .fdt file shrinks by a factor of 2, much better than with 
> doc-level compression). One other interesting thing is that {{index/snappy}} 
> is almost as compact as {{doc/deflate}} while it is more than 2x faster at 
> retrieving documents from disk.
> These tests have been done on a hot OS cache, which is the worst case for 
> compressed fields (one can expect better results for formats that have a high 
> compression ratio since they probably require fewer read/write operations 
> from disk).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4226) Efficient compression of small to medium stored fields

Reply via email to