[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

Robert Muir (JIRA) Thu, 03 Mar 2011 11:19:00 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-2948:
--------------------------------

    Attachment: LUCENE-2948_automaton.patch

Nice work Mike, i think I found a bug with nextPossiblePrefix though?

I attached my modifications to try to use this with Automaton. (just the 
automaton parts):

I'm also somehow triggering BlockReader's assert about crossing over index 
terms with other tests...

I "think" i could see the problem here... is it that 
nextPossiblePrefix(BytesRef prefix) means it wants me to truly pass in a 
prefix? obviously a consumer doesn't know which portion of his term is/isnt a 
prefix!

So we would have to expose that :(, or alternatively change the semantics to 
nextPossiblePrefix(BytesRef term)? In other words, in this situation of 
1[\u0000]234567891 it would simply return true, because it knows 1* exists 
rather than forwarding me to s? Maybe this is what was intended all along and 
its just an off by one?

{noformat}
    [junit] NOTE: reproduce with: ant test -Dtestcase=TestFuzzyQuery 
-Dtestmethod=testTokenLengthOpt 
-Dtests.seed=4471452442745287654:-2341611255635429887 -Dtests.codec=Standard

// NOTE: this index has two terms: "12345678911" and "segment"

    [junit] ------------- Standard Output ---------------
    [junit] candidate: [\u0000]1234567891
    [junit] not found, goto: 1
    [junit] candidate: 1[\u0000]234567891
    [junit] not found, goto: s <--- this is the problem, because "12345678911" 
exists
    [junit] candidate: s1234567891
    [junit] found!
    [junit] candidate: t1234567891
    [junit] found!
{noformat}

> Make var gap terms index a partial prefix trie
> ----------------------------------------------
>
>                 Key: LUCENE-2948
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2948
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2948.patch, LUCENE-2948.patch, 
> LUCENE-2948_automaton.patch
>
>
> Var gap stores (in an FST) the indexed terms (every 32nd term, by
> default), minus their non-distinguishing suffixes.
> However, often times the resulting FST is "close" to a prefix trie in
> some portion of the terms space.
> By allowing some nodes of the FST to store all outgoing edges,
> including ones that do not lead to an indexed term, and by recording
> that this node is then "authoritative" as to what terms exist in the
> terms dict from that prefix, we can get some important benefits:
>   * It becomes possible to know that a certain term prefix cannot
>     exist in the terms index, which means we can save a disk seek in
>     some cases (like PK lookup, docFreq, etc.)
>   * We can query for the next possible prefix in the index, allowing
>     some MTQs (eg FuzzyQuery) to save disk seeks.
> Basically, the terms index is able to answer questions that previously
> required seeking/scanning in the terms dict file.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2948) Make var gap terms index a partial prefix trie

Reply via email to