[ 
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003858#comment-13003858
 ] 

Dawid Weiss commented on LUCENE-2948:
-------------------------------------

Nice visualization! 

> Phew! So this means for any given suffix you'd know how many terms
> end with it? But how would we then use this info to better pick index
> nodes?

Such an inverted array would allow you to precalculate which FSA states have a 
dense subtree and which don't and freeze them even before you'd start 
constructing the full FSA. But I agree that even without trying this seems like 
a costly solution.

With respect to the speed differences table above -- I believe there is going 
to be no perfect method for all kinds of queries, so it's a tradeoff. What one 
could do is try to determine the distribution (bushiness factor of 
trienfication? :) of input terms and then build separate automata for specific 
kinds of queries... But then they'd consume more RAM, so another tradeoff kicks 
in.

> Make var gap terms index a partial prefix trie
> ----------------------------------------------
>
>                 Key: LUCENE-2948
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2948
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948.patch, 
> LUCENE-2948_automaton.patch, Results.png
>
>
> Var gap stores (in an FST) the indexed terms (every 32nd term, by
> default), minus their non-distinguishing suffixes.
> However, often times the resulting FST is "close" to a prefix trie in
> some portion of the terms space.
> By allowing some nodes of the FST to store all outgoing edges,
> including ones that do not lead to an indexed term, and by recording
> that this node is then "authoritative" as to what terms exist in the
> terms dict from that prefix, we can get some important benefits:
>   * It becomes possible to know that a certain term prefix cannot
>     exist in the terms index, which means we can save a disk seek in
>     some cases (like PK lookup, docFreq, etc.)
>   * We can query for the next possible prefix in the index, allowing
>     some MTQs (eg FuzzyQuery) to save disk seeks.
> Basically, the terms index is able to answer questions that previously
> required seeking/scanning in the terms dict file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to