[jira] [Commented] (LUCENE-6100) Further tuning of Lucene50Codec(BEST_COMPRESSION)

Ryan Ernst (JIRA) Mon, 08 Dec 2014 13:49:29 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238509#comment-14238509
 ]


Ryan Ernst commented on LUCENE-6100:
------------------------------------

+1

> Further tuning of Lucene50Codec(BEST_COMPRESSION)
> -------------------------------------------------
>
>                 Key: LUCENE-6100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6100
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-6100.patch
>
>
> Currently this codec has two options: BEST_SPEED and BEST_COMPRESSION. But in 
> the case of highly compressible data, the ratio for BEST_COMPRESSION is not 
> much over BEST_SPEED, because they share the same underlying format which is 
> not optimized for this here.
> block size is currently 24576 (32kb sliding window size minus 8kb "grace" to 
> avoid going over it). And we compress this in a stateless manner, each block 
> is its own stream and they dont share preset dictionary or anything. So we 
> have a lot of waste in many cases, since zlib has to reboot itself, then we 
> generally throw away 1/4 of the window and start over.
> I ran some experiments with highly compressible logs data:
> ||method||time indexing(ms)||time merging(ms)||fdt||fdx||
> |BEST_SPEED|101,729|15,638|372,845,282|406,964|
> |BEST_COMPRESSION|114,364|23,474|269,387,347|275.909|
> |patch (60KB)|105,533|18,914|237,284,342|117,639|
> The other experiments I ran were:
> ||method||time indexing(ms)||time merging(ms)||fdt||fdx||
> |crappy preset|130,854|38,095|234,603,971|274,500|
> |64KB|107,256|21,570|236,004,297|111,135|
> |crappy preset+64KB|121,503|30,030|222,422,924|110,751|
> For 'crappy preset' I just use arbitrary first 32KB bytes of original data as 
> a preset dictionary for every block. This is effective, but slow because of 
> some unnecessary overhead involved (like computing adler32 over and over of 
> the preset dict for each block). However, this overhead is reduced with 
> larger block sizes, and still offers benefits, so maybe in the future we can 
> do it (especially e.g. if its per-chunk and we can bulk merge chunks without 
> recompressing, etc).
> For 64KB, we measure removing the "grace" completely so it spills to another 
> block each time. The proposed smaller "grace" amount still offers cpu 
> savings, so I think we should keep it. But its not terrible if you go over.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6100) Further tuning of Lucene50Codec(BEST_COMPRESSION)

Reply via email to