[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119977#comment-14119977
 ] 

Eks Dev commented on LUCENE-5914:
---------------------------------

lovely, thanks for explaining, I expected something like this but was not 100% 
sure without looking into code. 
Simply, I see absolutely nothing ono might wish from general, OOTB compression 
support... 

In theory...
The only meaningful enhancements to the standard are possible to come only by 
modelling semantics of the data (the user must know quite a bit about the 
distribution of the data) to improve compression/speed => but this cannot be 
provided by the core, (Lucene is rightly content agnostic), at most the core 
APIs might make it more or less comfortable, but imo nothing more. 

For example (contrived as LZ4 would deal with it quite ok, just to illustrate), 
if I know that my field contains up to 5 distinct string values, I might  add 
simple dictionary coding to use max one byte without even going to codec level. 
The only place where I see theoretical possibility to need to go down-dirty is 
if I would want to reach sub-byte representations (3 bits per value in 
example), but this is rarely needed/hard to beat default LZ4/deflate and also 
even harder not to make slow. At the end of a day, someone who needs this type 
of specialisation should be able to write his own per-field codec.

Great work, and thanks again!

     

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 4.11
>
>         Attachments: LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the 
> same amount of users complaining that compression was too aggressive and that 
> compression was too light.
> I think it is due to the fact that we have users that are doing very 
> different things with Lucene. For example if you have a small index that fits 
> in the filesystem cache (or is close to), then you might never pay for actual 
> disk seeks and in such a case the fact that the current stored fields format 
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like 
> log analytics, and in that case you have huge amounts of data for which you 
> don't care much about stored fields performance. However it is very 
> frustrating to notice that the data that you store takes several times less 
> space when you gzip it compared to your index although Lucene claims to 
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that 
> would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to