[
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684148#comment-13684148
]
Michael McCandless commented on LUCENE-5029:
--------------------------------------------
Thanks Han!
I think the changes to TermState, the separate TermsMetaData, with the
FST-like algebra, can be still further simplified.
In fact, I think we need to "isolate" the long[]/byte[] "contract"
to the reading and writing, i.e. I think *TermState shouldn't have to
change at all.
Specifically, I think we should first do the amoeba step of moving
flushTermsBlock and readTermsBlock out of the PostingsBaseFormat and
into the terms dict:
* First fork off the TempPostingsFormat (TempBlockTree,
TempPostingsBaseFormat, TempLucene41PF, etc.) and commit that to
new branch so we can iterate.
* Change TempPostingsWriterBase.finishTerm, to receive a long[] (not
a LongsRef I think?) and a DataOutput.
* Change TempPostingsReaderBase.nextTerm to receive a long[] (not a
LongsRef I think?) and a DataInput.
* Using those new APIs, move flushTermsBlock (along with the
buffering of PendingTerms) into TempBlockTreeWriter and out of the
postings base format. And similarly with readTermsBlock
This way, the *TermState can (I think?) remain unchanged, and the
long[]/byte[] is limited entirely to serialization.
For "don't care" values in the long[], I think the contract can simply
be that the writer should not change whatever value was already in the
incoming array?
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
> Key: LUCENE-5029
> URL: https://issues.apache.org/jira/browse/LUCENE-5029
> Project: Lucene - Core
> Issue Type: Sub-task
> Reporter: Han Jiang
> Assignee: Han Jiang
> Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch,
> LUCENE-5029.branch-init.patch, LUCENE-5029.patch, LUCENE-5029.patch,
> LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use
> FST<BytesRef> as a base data structure, this might not share much data in
> parent arcs, since the encoded BytesRef doesn't guarantee that
> 'Outputs.common()' always creates a long prefix.
> While for current postings format, it is guaranteed that each FP (pointing to
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That
> means, between two Outputs, the Outputs from smaller term can be safely
> pushed towards root. However we always have some tricky TermState to deal
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we
> can simply cut the whole TermState into two parts: one part for comparation
> and intersection, another for restoring generic data. Then the data structure
> will be clear: this generic 'TermState' will consist of a fixed-length
> LongsRef and variable-length BytesRef.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]