[
https://issues.apache.org/jira/browse/BLUR-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480007#comment-13480007
]
Aaron McCurry commented on BLUR-30:
-----------------------------------
So a little background here, Blur use to have a CompressedFieldStoreDirectory
that would compress the data being written to the FDT file (which is used to
store fields). It was a bit of a hack to implement, in the latest version of
Lucene this hack is no longer necessary. Flexible indexing in Lucene 4 allows
us to implement our own Codec for storing all information. This task is to
re-implement the CompressedFieldDirectory as an extension of AppendingCodec.
So in my previous comment I spoke of using a built-in data structure for
storing this information, like a SequenceFile. If we were to use a
SequenceFile, we would need to create a index file for the SequenceFile, let me
explain. Documents are accessed by document id (0 up integer) per segment. If
we store the document as the value and the document id as the key for each
key/value pair in the SequenceFile then we would would get the RECORD or BLOCK
storage for free. However, finding a document by id would require a scan of
the file which would be very expensive. So along with using a SequenceFile we
will need a second file for find the location of the key/value pair in the
SequenceFile, hence the "index" for the SequenceFile.
> Extend lucene 4 AppendingCodec and add a compression option for the field
> storage.
> ----------------------------------------------------------------------------------
>
> Key: BLUR-30
> URL: https://issues.apache.org/jira/browse/BLUR-30
> Project: Apache Blur
> Issue Type: Improvement
> Reporter: Aaron McCurry
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira