[jira] [Commented] (LUCENE-5914) More options for stored fields compression

Robert Muir (JIRA) Mon, 01 Dec 2014 22:36:52 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231066#comment-14231066
 ]


Robert Muir commented on LUCENE-5914:
-------------------------------------

I also want to propose a new way to proceed here. In my opinion this issue 
tries to do a lot at once:

* make changes to the default codec
* support high and low compression options in the default codec with backwards 
compatibility
* provide some easy way to "choose" between supported options without having to 
use FilterCodec
* new lz4 implementation
* new deflate implementation

I think its too scary to do all at once. I would prefer we start by exposing 
the current CompressionMode.HIGH_COMPRESSION as the "high compression" option. 
At least for that one test dataset i used above (2GB highly compressible apache 
server logs), this is reasonably competitive with the deflate option on this 
issue:
||impl||size||index time||force merge time||
|trunk_HC|275,262,504|143,264|49,030|

But more importantly, HighCompressingCodec has been baking in our test suite 
for years, scary bugs knocked out of it, etc.
I think we should first figure out the plumbing to expose that, its something 
we could realistically do for lucene 5.0 and have confidence in. There is still 
plenty to do there to make that option work: exposing the configuration option, 
addressing concerns about back compat testing (we should generate back compat 
indexes both ways), and so on. But at least there is a huge head start on 
testing, code correctness, etc: its baked.

For new proposed formats (LZ4 with shared dictionary, deflate, whatever), I 
think we should address each one individually, adding to the codecs/ package 
first / getting into tests / baking in a similar way... doesn't need to be 
years but we should split these concerns.

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0
>
>         Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
> LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the 
> same amount of users complaining that compression was too aggressive and that 
> compression was too light.
> I think it is due to the fact that we have users that are doing very 
> different things with Lucene. For example if you have a small index that fits 
> in the filesystem cache (or is close to), then you might never pay for actual 
> disk seeks and in such a case the fact that the current stored fields format 
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like 
> log analytics, and in that case you have huge amounts of data for which you 
> don't care much about stored fields performance. However it is very 
> frustrating to notice that the data that you store takes several times less 
> space when you gzip it compared to your index although Lucene claims to 
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that 
> would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5914) More options for stored fields compression

Reply via email to