[
https://issues.apache.org/jira/browse/LUCENE-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003685#comment-13003685
]
Michael McCandless commented on LUCENE-2948:
--------------------------------------------
Thanks Dawid.
It really is hairy -- I wish we could find a way to simplify it.
Pruning as-you-go, backwards on compiling a now "finished" tail is
just really hard to think about!
bq. I've been thinking about the questions stated in the comments – would it be
possible to create a "better" pruning/ path keeping method. I honestly don't
think it is possible if you add terms incrementally to the FST because some of
the information required to keep or prune states is not available until the
very end.
Yeah I do too. We'd need a more global view on how best to "spend"
the allowed budget.
For some cases (dense PKs) it takes very little budget to have a full
prefix trie, and you get good speedups for this case. But for other
cases (eg, GUIDs as your PKs) it's the polar opposite.
bq. One method I was thinking of was to determine "deep" subtrees of states
with an approximately equal size (and prune them entirely, or at least part of
them). These "deep" subtrees (or precalculated frozen states if you want to
think of them this way) can be computed by sorting reversed input sequences and
calculating their LCP (longest common prefix) table – then the shared prefixes
are actually suffixes of the input sequences... You can then linearly scan such
a reversed table and you'd know immediately how large a given subtree of
suffixes is. One problem is that this is exact only in state representation of
the FST (in the edge/LAST representation an input suffix can be integrated with
a longer suffix, as you recall).
Phew! So this means for any given suffix you'd know how many terms
end with it? But how would we then use this info to better pick index
nodes?
bq. Since we're using edge-based representation I don't think the above idea
helps much. Also, it would require sorting the reversed terms and calculating
an LCP table... this may be even more costly than what is mentioned in the code
comment – use two passes, build an FSA first, determine nodes to prune and then
build the final FST.
Yeah I fear it would be costly...
> Make var gap terms index a partial prefix trie
> ----------------------------------------------
>
> Key: LUCENE-2948
> URL: https://issues.apache.org/jira/browse/LUCENE-2948
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2948.patch, LUCENE-2948.patch, LUCENE-2948.patch,
> LUCENE-2948_automaton.patch
>
>
> Var gap stores (in an FST) the indexed terms (every 32nd term, by
> default), minus their non-distinguishing suffixes.
> However, often times the resulting FST is "close" to a prefix trie in
> some portion of the terms space.
> By allowing some nodes of the FST to store all outgoing edges,
> including ones that do not lead to an indexed term, and by recording
> that this node is then "authoritative" as to what terms exist in the
> terms dict from that prefix, we can get some important benefits:
> * It becomes possible to know that a certain term prefix cannot
> exist in the terms index, which means we can save a disk seek in
> some cases (like PK lookup, docFreq, etc.)
> * We can query for the next possible prefix in the index, allowing
> some MTQs (eg FuzzyQuery) to save disk seeks.
> Basically, the terms index is able to answer questions that previously
> required seeking/scanning in the terms dict file.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]