[ https://issues.apache.org/jira/browse/LUCENE-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238509#comment-14238509 ]
Ryan Ernst commented on LUCENE-6100: ------------------------------------ +1 > Further tuning of Lucene50Codec(BEST_COMPRESSION) > ------------------------------------------------- > > Key: LUCENE-6100 > URL: https://issues.apache.org/jira/browse/LUCENE-6100 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Robert Muir > Attachments: LUCENE-6100.patch > > > Currently this codec has two options: BEST_SPEED and BEST_COMPRESSION. But in > the case of highly compressible data, the ratio for BEST_COMPRESSION is not > much over BEST_SPEED, because they share the same underlying format which is > not optimized for this here. > block size is currently 24576 (32kb sliding window size minus 8kb "grace" to > avoid going over it). And we compress this in a stateless manner, each block > is its own stream and they dont share preset dictionary or anything. So we > have a lot of waste in many cases, since zlib has to reboot itself, then we > generally throw away 1/4 of the window and start over. > I ran some experiments with highly compressible logs data: > ||method||time indexing(ms)||time merging(ms)||fdt||fdx|| > |BEST_SPEED|101,729|15,638|372,845,282|406,964| > |BEST_COMPRESSION|114,364|23,474|269,387,347|275.909| > |patch (60KB)|105,533|18,914|237,284,342|117,639| > The other experiments I ran were: > ||method||time indexing(ms)||time merging(ms)||fdt||fdx|| > |crappy preset|130,854|38,095|234,603,971|274,500| > |64KB|107,256|21,570|236,004,297|111,135| > |crappy preset+64KB|121,503|30,030|222,422,924|110,751| > For 'crappy preset' I just use arbitrary first 32KB bytes of original data as > a preset dictionary for every block. This is effective, but slow because of > some unnecessary overhead involved (like computing adler32 over and over of > the preset dict for each block). However, this overhead is reduced with > larger block sizes, and still offers benefits, so maybe in the future we can > do it (especially e.g. if its per-chunk and we can bulk merge chunks without > recompressing, etc). > For 64KB, we measure removing the "grace" completely so it spills to another > block each time. The proposed smaller "grace" amount still offers cpu > savings, so I think we should keep it. But its not terrible if you go over. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org