[ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684612#comment-13684612
 ] 

Michael McCandless commented on LUCENE-5029:
--------------------------------------------

Patch looks great, thanks Han!  It's so awesome to see all that hairy
terms block code disappearing from PostingsReader/Writer.

I think you should commit it to the branch and then we can iterate on
the following?:

I think only PostingsBaseWriter should have .longsSize(), and then the
terms dict should store this int itself and later load it at read
time.  This keeps the index "self documenting", so an errant PBF that
reports the wrong longsSize at read time is not possible.  Also, I
think it should not take a FieldInfo.  Per-field-ness is handled
higher up (PerFieldPostingsFormat).

I think TempBlockTermsWriter.PendingMetaData should hold the byte[]
not the RAMOutputStream?  I think RAMOutputStream holds its buffer as
1KB sized chunks... we only need the RAMOutputStream while the PBF is
finishing that term; after that we can extract & convert to byte[] I
think.

Instead of -1 for "don't care", I think TempPostingsWriterBase impls
should simply not change the value?  This is part of the contract.

Instead of making a separate PendingMetaData in the
TempBlockTermWriter, can we put the byte[] + long[] onto the existing
PendingTerm?  Then we can just pass the slice of PendingTerm down to
flushTermsBlock, fixing it to skip the block entries.

Can we rename nextTerm to decodeTerm?  ("next" used to be appropriate
when it was decoding the next term in the block... but that's an impl
detail of the terms dict now).

Separately from this effort, now that this issue will make the
per-term long[] visible to the terms dict, we can now easily
investigate better ways of storing that long[] data than simple
delta-coded vLongs, e.g. maybe Simple64 "column stride" would work
well.  But this is separate :)

                
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5029
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, 
> LUCENE-5029.branch-init.patch, LUCENE-5029.patch, LUCENE-5029.patch, 
> LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use 
> FST<BytesRef> as a base data structure, this might not share much data in 
> parent arcs, since the encoded BytesRef doesn't guarantee that 
> 'Outputs.common()' always creates a long prefix. 
> While for current postings format, it is guaranteed that each FP (pointing to 
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
> means, between two Outputs, the Outputs from smaller term can be safely 
> pushed towards root. However we always have some tricky TermState to deal 
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
> can simply cut the whole TermState into two parts: one part for comparation 
> and intersection, another for restoring generic data. Then the data structure 
> will be clear: this generic 'TermState' will consist of a fixed-length 
> LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to