[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

Robert Muir (JIRA) Mon, 13 Dec 2010 08:09:30 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970870#action_12970870
 ]


Robert Muir commented on LUCENE-2810:
-------------------------------------

bq. Where in my email did I say that users had to use it?

We didnt force users to use the old compression either? But there are even 
emails on the userlists of someone asking 'where did compressed fields go'
and we said the reasons why, and then sure enough they reported back that it 
only made their data larger and slower.

So, I'm not sure we should add something so app-dependent to lucene's core, as 
it depends very heavily on the content you are indexing.
If people see compression in the core APIs they are going to assume that it 
works well in the general purpose case, but I'm trying to say
thats very tricky to do.

a trivial example, 
case 1: perhaps your documents have many fields all redundant with each other.
case 2: This is very different from documents that have only 1 field thats 
heavy redundant and the rest are not, e.g. nearly unique metadata.

For these two use cases you need to implement the 'compression'/layout 
completely differently or you only introduce waste, in the case of many fields 
and wrong block size you just make things bigger and it acts like Compression 
1.0 all over again.


> Explore Alternate Stored Field approaches for highly redundant data
> -------------------------------------------------------------------
>
>                 Key: LUCENE-2810
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2810
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>
> In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for 
> documents contain a lot of redundant information and end up wasting a lot of 
> space across a large collection of documents.  For instance, simply 
> compressing a typical log file often results in > 75% compression rates.  We 
> should explore mechanisms for applying compression across all the documents 
> for a field (or fields) while still maintaining relatively fast lookup (that 
> being said, in most logging applications, fast retrieval of a given event is 
> not always critical.)  For instance, perhaps it is possible to have a part of 
> storage that contains the set of unique values for all the fields and the 
> document field value simply contains a reference (could be as small as a few 
> bits depending on the number of uniq. items) to that value instead of having 
> a full copy.  Extending this, perhaps we can leverage some existing 
> compression capabilities in Java to provide this as well.  
> It may make sense to implement this as a Directory, but it might also make 
> sense as a Codec, if and when we have support for changing storage Codecs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2810) Explore Alternate Stored Field approaches for highly redundant data

Reply via email to