[jira] Commented: (LUCENE-2810) Stored Fields Compression

Grant Ingersoll (JIRA) Mon, 13 Dec 2010 07:33:25 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970852#action_12970852
 ]


Grant Ingersoll commented on LUCENE-2810:
-----------------------------------------

bq. I think Grant was looking for something that could compress across fields 
of different documents (i.e. where every document represents a log record). 

Yes, this is what I meant.  To the others, please go back and read what I wrote 
before you jump to conclusions.  I'm not talking about compressing a particular 
field or even a particular document.  I'm talking about alternate storage 
techniques for large quantities of repeated (or near repeated, potentially) 
documents.  It doesn't even have to be GZIP.  There are plenty of use cases for 
this and I believe it can be done effectively in Lucene with out messing up 
API's, etc.  And it does belong in Lucene b/c I don't want to have to introduce 
another storage technique.  It could be something as simple as a Directory 
implementation that handles stored fields differently underneath the hood, but 
all the APIs are the same.

bq. For instance how do you deal with partial field loading?

That will depend, I haven't thought about implementation yet.  The simplest 
approach to this may be a shared area where all the unique docs live and then 
per document storage just contains a file offset pointer to the original doc.  
Sure, it's not the highest compression one could get, but it could be pretty 
good without too much effort.  In that case, partial field loading would work 
just fine.

> Stored Fields Compression
> -------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
> documents contain a lot of redundant information and end up wasting a lot of 
> space across a large collection of documents.  For instance, simply 
> compressing a typical log file often results in > 75% compression rates.  We 
> should explore mechanisms for applying compression across all the documents 
> for a field (or fields) while still maintaining relatively fast lookup (that 
> being said, in most logging applications, fast retrieval of a given event is 
> not always critical.)  For instance, perhaps it is possible to have a part of 
> storage that contains the set of unique values for all the fields and the 
> document field value simply contains a reference (could be as small as a few 
> bits depending on the number of uniq. items) to that value instead of having 
> a full copy.  Extending this, perhaps we can leverage some existing 
> compression capabilities in Java to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make 
> sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Stored Fields Compression

Reply via email to