[ https://issues.apache.org/jira/browse/LUCENE-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-4226: --------------------------------- Attachment: LUCENE-4226.patch Thanks for your kind words, David! Here is a new version of the patch. I've though a lot about whether or not to let users configure per-field compression, but I think we should just try to provide something simple that improves the compression ratio by allowing cross-field and cross-document compression ; People who have very specific needs can still implement their own {{StoredFieldsFormat}}. Moreover I've had a discussion with Robert who argued that we should limit the number of classes that are exposed as a SPI because they add complexity (for example Solr needs to reload SPI registers every time it adds a core lib directory to the classpath). So I tried to make it simpler: there is no more {{CompressionCodec}} and people can choose between 3 different compression modes: - FAST, that uses LZ4's fast compressors and uncompressors (for indices that have a high update rate), - HIGH_COMPRESSION, that uses deflate (for people who want low compression ratios, no matter what the performance penalty is), - FAST_UNCOMPRESSION that spends more time compressing using LZ4's compress_HC method but still has very fast uncompression (for indices that have a reasonnable update rate and need good read performance). I also added a test case and applied Dawid's advice to replace the default {{skipBytes}} implementation with a bulk-write into a write-only buffer. Here is a new benchmark that shows how this new codec can help compress stored fields. This time, I indexed some access.log files generated by Apache HTTP server. A document consists of a line from the log file and is typically between 100 and 300 bytes. Because every line contains the date of the request, its path and the user-agent of the client, there is a lot of redundancy across documents. {noformat} Format Chunk size Compression ratio IndexReader.document time ————————————————————————————————————————————————————————————————————————————— uncompressed 100% 100% doc/deflate 1 90% 1557% doc/deflate 9 90% 1539% index/FAST 512 50% 197% index/HIGH_COMPRESSION 512 44% 1545% index/FAST_UNCOMPRESSION 512 50% 198% {noformat} Because documents are very small, document-level compression doesn't work well and only makes the .fdt file 10% smaller while loading documents from disk is more than 15 times slower on a hot OS cache. However, with this kind of highly redundant input, {{CompressionMode.FAST}} looks very interesting as it divides the size of the .fdt file by 2 and only makes IndexReader.document twice slower. > Efficient compression of small to medium stored fields > ------------------------------------------------------ > > Key: LUCENE-4226 > URL: https://issues.apache.org/jira/browse/LUCENE-4226 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Adrien Grand > Priority: Trivial > Attachments: CompressionBenchmark.java, CompressionBenchmark.java, > LUCENE-4226.patch, LUCENE-4226.patch, LUCENE-4226.patch, > SnappyCompressionAlgorithm.java > > > I've been doing some experiments with stored fields lately. It is very common > for an index with stored fields enabled to have most of its space used by the > .fdt index file. To prevent this .fdt file from growing too much, one option > is to compress stored fields. Although compression works rather well for > large fields, this is not the case for small fields and the compression ratio > can be very close to 100%, even with efficient compression algorithms. > In order to improve the compression ratio for small fields, I've written a > {{StoredFieldsFormat}} that compresses several documents in a single chunk of > data. To see how it behaves in terms of document deserialization speed and > compression ratio, I've run several tests with different index compression > strategies on 100,000 docs from Mike's 1K Wikipedia articles (title and text > were indexed and stored): > - no compression, > - docs compressed with deflate (compression level = 1), > - docs compressed with deflate (compression level = 9), > - docs compressed with Snappy, > - using the compressing {{StoredFieldsFormat}} with deflate (level = 1) and > chunks of 6 docs, > - using the compressing {{StoredFieldsFormat}} with deflate (level = 9) and > chunks of 6 docs, > - using the compressing {{StoredFieldsFormat}} with Snappy and chunks of 6 > docs. > For those who don't know Snappy, it is compression algorithm from Google > which has very high compression ratios, but compresses and decompresses data > very quickly. > {noformat} > Format Compression ratio IndexReader.document time > ———————————————————————————————————————————————————————————————— > uncompressed 100% 100% > doc/deflate 1 59% 616% > doc/deflate 9 58% 595% > doc/snappy 80% 129% > index/deflate 1 49% 966% > index/deflate 9 46% 938% > index/snappy 65% 264% > {noformat} > (doc = doc-level compression, index = index-level compression) > I find it interesting because it allows to trade speed for space (with > deflate, the .fdt file shrinks by a factor of 2, much better than with > doc-level compression). One other interesting thing is that {{index/snappy}} > is almost as compact as {{doc/deflate}} while it is more than 2x faster at > retrieving documents from disk. > These tests have been done on a hot OS cache, which is the worst case for > compressed fields (one can expect better results for formats that have a high > compression ratio since they probably require fewer read/write operations > from disk). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org