[ https://issues.apache.org/jira/browse/LUCENE-2810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970861#action_12970861 ]
Simon Willnauer commented on LUCENE-2810: ----------------------------------------- Grant, I undestand what you are trying todo I just question if what you are proposing is something that really belongs into the core of if it should be done in a pluggable codec. If you can do it in a dir impl well then just go ahead and put it under misc. I don't know what should keep you from doing it. Yet, I think that seems much more like something for a codec and I think that support is needed desperately. If that is in place - we would not discuss that for long really.... simon > Stored Fields Compression > ------------------------- > > Key: LUCENE-2810 > URL: https://issues.apache.org/jira/browse/LUCENE-2810 > Project: Lucene - Java > Issue Type: Improvement > Components: Store > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > > In some cases (logs, HTML pages w/ boilerplate, etc.), the stored fields for > documents contain a lot of redundant information and end up wasting a lot of > space across a large collection of documents. For instance, simply > compressing a typical log file often results in > 75% compression rates. We > should explore mechanisms for applying compression across all the documents > for a field (or fields) while still maintaining relatively fast lookup (that > being said, in most logging applications, fast retrieval of a given event is > not always critical.) For instance, perhaps it is possible to have a part of > storage that contains the set of unique values for all the fields and the > document field value simply contains a reference (could be as small as a few > bits depending on the number of uniq. items) to that value instead of having > a full copy. Extending this, perhaps we can leverage some existing > compression capabilities in Java to provide this as well. > It may make sense to implement this as a Directory, but it might also make > sense as a Codec, if and when we have support for changing storage Codecs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org