[ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970876#action_12970876
 ] 

Grant Ingersoll commented on LUCENE-2810:
-----------------------------------------

bq. If people see compression in the core APIs

Like I said, it can go in contrib.

Again, you seem to be hung up on the word compression, so let's stop using it.  
I'm not necessarily talking about compression here, OK?  Compression is an 
example of an alternate storage technique, but it isn't the only way to solve 
this problem and as you point out it may not always be the best thing to do.  
Having said that, I've seen enough applications from a very wide set of users 
over the years that I can see many use cases where going beyond our simple 
storage mechanisms would be useful and giving users alternate tools for storage 
is a good thing, especially since retrieving stored fields is almost always one 
of the biggest performance killers in real world applications.

> Explore Alternate Stored Field approaches for highly redundant data
> -------------------------------------------------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
> documents contain a lot of redundant information and end up wasting a lot of 
> space across a large collection of documents.  For instance, simply 
> compressing a typical log file often results in > 75% compression rates.  We 
> should explore mechanisms for applying compression across all the documents 
> for a field (or fields) while still maintaining relatively fast lookup (that 
> being said, in most logging applications, fast retrieval of a given event is 
> not always critical.)  For instance, perhaps it is possible to have a part of 
> storage that contains the set of unique values for all the fields and the 
> document field value simply contains a reference (could be as small as a few 
> bits depending on the number of uniq. items) to that value instead of having 
> a full copy.  Extending this, perhaps we can leverage some existing 
> compression capabilities in Java to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make 
> sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to