[jira] [Commented] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Michael McCandless (JIRA) Thu, 13 Jun 2013 08:49:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682352#comment-13682352
 ]


Michael McCandless commented on LUCENE-5029:
--------------------------------------------

Good lord I see you had to uncomment BlockTree's DEBUG prints: I'm
sorry!  It's a good thing your hair situation is OK.

So the overall idea here is to break out a separate class
(TermMetaData) that the PostingsBaseFormat uses to hold its "private
stuff" about a term, and to "require" that only a monotonic long[] and
an "arbitrary" byte[] are available to the PBF (this is so that FST
can more cleanly encode per-term PBF metadata).

The "generic" stuff "belonging" to the terms dict (docFreq,
totalTermFreq) remains in the BlockTermState.

Seeing how much overhead this added, with a 2nd object per-term, and
with a double-encode process at write time in some cases (first
writing into the byte[], then reading from the byte[] and writing into
the "final" format), I'm a little nervous about this approach.

It's also not great that the PostingsBaseFormat is still "responsible"
for the block-encoding (holding pendingTerms, implementing
read/flushTermsBlock and nextTerm): this is really an impl detail of
the terms dict and I had thought we could move into the terms dict so
we could simplify PBF's to be agnostic to whether the terms dict was
block-based or not.

Also, at write time, it'd be nice if the PBF got a DataOutput +
LongsRef from the terms dict which it could write the term data to,
and a "matching" DataInput + LongsRef at read time.  I'm not sure
these even need to reside in BlockTermState?

A few specific questions:

  * Does something fail with a clear exception if a PBF tries to have a
    long[n] go backwards?

  * How does a "don't care" long[n] value work?

  * A few sources are missing the copyright header...

  * Instead of "base" and "extend" maybe we should just name them "longs"
    and "bytes"?

  * The "bytes" can be arbitrary length per term right?  So how come
    TermMetaData ctor takes a length for it?

Finally, I think the patch is a little too ambitious: for starters,
while we are iterating on how to improve the API between terms dict
and PBF, it's too costly to try to get all postings formats cutover.
Instead, I think we should fork BlockTree to a new terms dict impl,
and fork e.g. Lucene41PB/F, and iterate on that single PostingsFormat?
This should be much less code to change as we iterate on the approach?

                
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5029
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use 
> FST<BytesRef> as a base data structure, this might not share much data in 
> parent arcs, since the encoded BytesRef doesn't guarantee that 
> 'Outputs.common()' always creates a long prefix. 
> While for current postings format, it is guaranteed that each FP (pointing to 
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
> means, between two Outputs, the Outputs from smaller term can be safely 
> pushed towards root. However we always have some tricky TermState to deal 
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
> can simply cut the whole TermState into two parts: one part for comparation 
> and intersection, another for restoring generic data. Then the data structure 
> will be clear: this generic 'TermState' will consist of a fixed-length 
> LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Reply via email to