[jira] [Commented] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Han Jiang (JIRA) Thu, 13 Jun 2013 09:11:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682377#comment-13682377
 ]


Han Jiang commented on LUCENE-5029:
-----------------------------------

bq. Good lord I see you had to uncomment BlockTree's DEBUG prints: 

Oops, sorry, I didn't remove those change from patch :)

bq. Seeing how much overhead this added, with a 2nd object per-term, and with a 
double-encode process at write time in some cases (first writing into the 
byte[],

Yes, it seem to hurt much. I think we can at least downgrade this to a long[] + 
extendable part, so that it can be customized as a new MetaData in PBF side?
Actually, the reason why we need long[], is that we need a 'cleaner' way to 
write Outputs algebras.

bq. It's also not great that the PostingsBaseFormat is still "responsible" for 
the block-encoding (holding pendingTerms, implementing read/flushTermsBlock and 
nextTerm): 

We can do this later, but the pulsing codec is the tricky one I didn't dare to 
dig too deep: it has to forece termBlockOrd=0 every time readTermsBlock is 
called. 
This extra actioin won't be defined by term dict side.

bq. Does something fail with a clear exception if a PBF tries to have a long[n] 
go backwards?

Hmm, Mike, I'm not sure... What is the goal to make it go backwards?

bq. How does a "don't care" long[n] value work?

I planned to make those don't care value defined in algebra operations. Like we 
have an instance A, defining A.subtract(B), in which those don't-care ones are 
always equal to 
related ones in B. And the Outpus will use those algebra operations to get 
common(), subtract(), add() work as well.

bq. The "bytes" can be arbitrary length per term right? So how come 
TermMetaData ctor takes a length for it?

Since current approach is to 'encode then decode' values to that bytes, the 
length is usually pre-defined... 
Yes we have exceptions, like in pulsing codec, the posings data can also fit 
into this generic part. I'll change it.
                
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5029
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use 
> FST<BytesRef> as a base data structure, this might not share much data in 
> parent arcs, since the encoded BytesRef doesn't guarantee that 
> 'Outputs.common()' always creates a long prefix. 
> While for current postings format, it is guaranteed that each FP (pointing to 
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
> means, between two Outputs, the Outputs from smaller term can be safely 
> pushed towards root. However we always have some tricky TermState to deal 
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
> can simply cut the whole TermState into two parts: one part for comparation 
> and intersection, another for restoring generic data. Then the data structure 
> will be clear: this generic 'TermState' will consist of a fixed-length 
> LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Reply via email to