[
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682352#comment-13682352
]
Michael McCandless commented on LUCENE-5029:
--------------------------------------------
Good lord I see you had to uncomment BlockTree's DEBUG prints: I'm
sorry! It's a good thing your hair situation is OK.
So the overall idea here is to break out a separate class
(TermMetaData) that the PostingsBaseFormat uses to hold its "private
stuff" about a term, and to "require" that only a monotonic long[] and
an "arbitrary" byte[] are available to the PBF (this is so that FST
can more cleanly encode per-term PBF metadata).
The "generic" stuff "belonging" to the terms dict (docFreq,
totalTermFreq) remains in the BlockTermState.
Seeing how much overhead this added, with a 2nd object per-term, and
with a double-encode process at write time in some cases (first
writing into the byte[], then reading from the byte[] and writing into
the "final" format), I'm a little nervous about this approach.
It's also not great that the PostingsBaseFormat is still "responsible"
for the block-encoding (holding pendingTerms, implementing
read/flushTermsBlock and nextTerm): this is really an impl detail of
the terms dict and I had thought we could move into the terms dict so
we could simplify PBF's to be agnostic to whether the terms dict was
block-based or not.
Also, at write time, it'd be nice if the PBF got a DataOutput +
LongsRef from the terms dict which it could write the term data to,
and a "matching" DataInput + LongsRef at read time. I'm not sure
these even need to reside in BlockTermState?
A few specific questions:
* Does something fail with a clear exception if a PBF tries to have a
long[n] go backwards?
* How does a "don't care" long[n] value work?
* A few sources are missing the copyright header...
* Instead of "base" and "extend" maybe we should just name them "longs"
and "bytes"?
* The "bytes" can be arbitrary length per term right? So how come
TermMetaData ctor takes a length for it?
Finally, I think the patch is a little too ambitious: for starters,
while we are iterating on how to improve the API between terms dict
and PBF, it's too costly to try to get all postings formats cutover.
Instead, I think we should fork BlockTree to a new terms dict impl,
and fork e.g. Lucene41PB/F, and iterate on that single PostingsFormat?
This should be much less code to change as we iterate on the approach?
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
> Key: LUCENE-5029
> URL: https://issues.apache.org/jira/browse/LUCENE-5029
> Project: Lucene - Core
> Issue Type: Sub-task
> Reporter: Han Jiang
> Assignee: Han Jiang
> Priority: Minor
> Fix For: 4.4
>
> Attachments: LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use
> FST<BytesRef> as a base data structure, this might not share much data in
> parent arcs, since the encoded BytesRef doesn't guarantee that
> 'Outputs.common()' always creates a long prefix.
> While for current postings format, it is guaranteed that each FP (pointing to
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That
> means, between two Outputs, the Outputs from smaller term can be safely
> pushed towards root. However we always have some tricky TermState to deal
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we
> can simply cut the whole TermState into two parts: one part for comparation
> and intersection, another for restoring generic data. Then the data structure
> will be clear: this generic 'TermState' will consist of a fixed-length
> LongsRef and variable-length BytesRef.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]