[jira] [Comment Edited] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Han Jiang (JIRA) Sat, 15 Jun 2013 03:21:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684121#comment-13684121
 ]


Han Jiang edited comment on LUCENE-5029 at 6/15/13 10:21 AM:
-------------------------------------------------------------

Update reader part, now we can safely remove termBlockOrd in BlockTermState, 
which means the API 
is OK for non-block based term dict. As for FST-based term dict, 

Also, I remove 'nextTerm' from PostingsReaderBase as well (Since it's already 
defined in TermMetaData.read())

The remaining job is then to bring back TermStateOuputs. Mike, again I'm in 
doubt with that long[]+byte[] design, 
as you can see, although in the codes the algebra operations have to be full of 
if clauses, it's still quite clear. 

And as for FST part, I think it should also be convenient to distinguish 
delta-decode and normal-decode, even Term dict
part sees nothing from TermMetaData -- current FST already provides readOutput 
& readFinalOutput, and for each coming term,
term dict can operate the algebra methods along the arcs based on which kind of 
Output it is.

The patch is still against trunk, but strange that it fails on this single test:

{code}
ant test  -Dtestcase=TestDrillSideways -Dtests.method=testRandom 
-Dtests.seed=7FEAE9B6DF414156 -Dtests.slow=true 
-Dtests.postingsformat=TempBlock -Dtests.locale=ar_KW 
-Dtests.timezone=America/Indiana/Winamac -Dtests.file.encoding=US-ASCII
{code}

But I suppose it is unrelated?
                
      was (Author: billy):
    Update reader part, now we can safely remove termBlockOrd in 
BlockTermState, which means the API 
is OK for non-block based term dict. As for FST-based term dict, 

Also, I remove 'nextTerm' from PostingsReaderBase as well (Since it's already 
defined in TermMetaData.read())

The remaining job is then to bring back TermStateOuputs. Mike, again I'm in 
doubt with that long[]+byte[] design, 
as you can see, the algebra operation has to be full of if clauses, but still 
quite clean. The hairy part might be
common(), since we might have to judge whether two instances follows the 
monotonical principle. And as for FST part,
I think it should also be convenient to distinguish delta-decode and 
normal-decode, since FST is already providing 
readOutput & readFinalOutput.

The patch is still against trunk, but strange that it fails on this single test:

{code}
ant test  -Dtestcase=TestDrillSideways -Dtests.method=testRandom 
-Dtests.seed=7FEAE9B6DF414156 -Dtests.slow=true 
-Dtests.postingsformat=TempBlock -Dtests.locale=ar_KW 
-Dtests.timezone=America/Indiana/Winamac -Dtests.file.encoding=US-ASCII
{code}

But I suppose it is unrelated?
                  
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5029
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, 
> LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use 
> FST<BytesRef> as a base data structure, this might not share much data in 
> parent arcs, since the encoded BytesRef doesn't guarantee that 
> 'Outputs.common()' always creates a long prefix. 
> While for current postings format, it is guaranteed that each FP (pointing to 
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
> means, between two Outputs, the Outputs from smaller term can be safely 
> pushed towards root. However we always have some tricky TermState to deal 
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
> can simply cut the whole TermState into two parts: one part for comparation 
> and intersection, another for restoring generic data. Then the data structure 
> will be clear: this generic 'TermState' will consist of a fixed-length 
> LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Reply via email to