[ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015627#comment-13015627 ]
Dawid Weiss commented on SOLR-2378: ----------------------------------- I'm done for the day. Lots of benchmarking and tweaking. Conclusions: - Use raw utf16 values of input strings as they are ready to use in the input Strings and don't need to be converted. The automaton is slightly bigger, but not that much (read: most likely because English ascii chars fall into single byte vint range anyway). - I created a non-recursive and recursive version of the main tail-state traversal loop, but the gain is minimal. - I have a relatively fast machine (i7, quad core) and parallel GC combined with tlabs makes local allocations nearly zero-cost. Locally reused structures speed up by 5%, but increase code complexity greatly. - The benchmark currently in the codebase is very skewed. I tried on real life data (that I cannot share, unfortunately) and the results are: {noformat} TSTLookup size[B]=12,598,794 FSTLookup size[B]=472,679 * Running 10 iterations for TSTLookup ... - warm-up 5 iterations... - main iterations: TSTLookup buildTime[ms]=19 lookupTime[ms]=4,813 * Running 10 iterations for FSTLookup ... - warm-up 5 iterations... - main iterations: FSTLookup buildTime[ms]=68 lookupTime[ms]=263 {noformat} It looks as if I've made a mistake, but not really. With the actual data the fan-out rate is much higher than on long numbers. Also, real terms are shorter than longs and 75% of their prefix is usually a 2-3 letter combination resulting in lots of matches that need to be passed through the priority queue, etc. With the FST weights are part of the search structure (approximated weights, to be exact) and so the cost of building it is higher, but the gain at runtime is substantial. The lookup time should be virtually constant (with very small deviation). - JaspellLookup does not pass the tests if the input prefix has a space inside (?). TSTLookup and FSTLookup work just fine. - I compared against automata API in Morfologik; the speedup is about 30% (only ints are passed, no Arc object decompression, etc.). But then -- the API of Lucene is more flexible. I'll clean up the code and create a final patch for testing/ evaluation tomorrow. > FST-based Lookup (suggestions) for prefix matches. > -------------------------------------------------- > > Key: SOLR-2378 > URL: https://issues.apache.org/jira/browse/SOLR-2378 > Project: Solr > Issue Type: New Feature > Components: spellchecker > Reporter: Dawid Weiss > Assignee: Dawid Weiss > Labels: lookup, prefix > Fix For: 4.0 > > > Implement a subclass of Lookup based on finite state automata/ transducers > (Lucene FST package). This issue is for implementing a relatively basic > prefix matcher, we will handle infixes and other types of input matches > gradually. Impl. phases: > - write a DFA based suggester effectively identical to ternary tree based > solution right now, > - baseline benchmark against tern. tree (memory consumption, rebuilding > speed, indexing speed; reuse Andrzej's benchmark code) > - modify DFA to encode term weights directly in the automaton (optimize for > onlyMostPopular case) > - benchmark again > - add infix suggestion support with prefix matches boosted higher (?) > - benchmark again > - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org