Stored Fields Compression
-------------------------

                 Key: LUCENE-2810
                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Store
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll


In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
documents contain a lot of redundant information and end up wasting a lot of 
space across a large collection of documents.  For instance, simply compressing 
a typical log file often results in > 75% compression rates.  We should explore 
mechanisms for applying compression across all the documents for a field (or 
fields) while still maintaining relatively fast lookup (that being said, in 
most logging applications, fast retrieval of a given event is not always 
critical.)  For instance, perhaps it is possible to have a part of storage that 
contains the set of unique values for all the fields and the document field 
value simply contains a reference (could be as small as a few bits depending on 
the number of uniq. items) to that value instead of having a full copy.  
Extending this, perhaps we can leverage some existing compression capabilities 
in Java to provide this as well.  

It may make sense to implement this as a Directory, but it might also make 
sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to