[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015627#comment-13015627
 ] 

Dawid Weiss commented on SOLR-2378:
-----------------------------------

I'm done for the day. Lots of benchmarking and tweaking. Conclusions:
- Use raw utf16 values of input strings as they are ready to use in the input 
Strings and don't need to be converted. The automaton is slightly bigger, but 
not that much (read: most likely because English ascii chars fall into single 
byte vint range anyway).
- I created a non-recursive and recursive version of the main tail-state 
traversal loop, but the gain is minimal.
- I have a relatively fast machine (i7, quad core) and parallel GC combined 
with tlabs makes local allocations nearly zero-cost. Locally reused structures 
speed up by 5%, but increase code complexity greatly.
- The benchmark currently in the codebase is very skewed. I tried on real life 
data (that I cannot share, unfortunately) and the results are:
{noformat}
TSTLookup            size[B]=12,598,794
FSTLookup            size[B]=472,679
* Running 10 iterations for TSTLookup ...
  - warm-up 5 iterations...
  - main iterations:
TSTLookup            buildTime[ms]=19 lookupTime[ms]=4,813
* Running 10 iterations for FSTLookup ...
  - warm-up 5 iterations...
  - main iterations:
FSTLookup            buildTime[ms]=68 lookupTime[ms]=263
{noformat}
It looks as if I've made a mistake, but not really. With the actual data the 
fan-out rate is much higher than on long numbers. Also, real terms are shorter 
than longs and 75% of their prefix is usually a 2-3 letter combination 
resulting in lots of matches that need to be passed through the priority queue, 
etc. With the FST weights are part of the search structure (approximated 
weights, to be exact) and so the cost of building it is  higher, but the gain 
at runtime is substantial. The lookup time should be virtually constant (with 
very small deviation).
- JaspellLookup does not pass the tests if the input prefix has a space inside 
(?). TSTLookup and FSTLookup work just fine.
- I compared against automata API in Morfologik; the speedup is about 30% (only 
ints are passed, no Arc object decompression, etc.). But then -- the API of 
Lucene is more flexible.

I'll clean up the code and create a final patch for testing/ evaluation 
tomorrow.

> FST-based Lookup (suggestions) for prefix matches.
> --------------------------------------------------
>
>                 Key: SOLR-2378
>                 URL: https://issues.apache.org/jira/browse/SOLR-2378
>             Project: Solr
>          Issue Type: New Feature
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>              Labels: lookup, prefix
>             Fix For: 4.0
>
>
> Implement a subclass of Lookup based on finite state automata/ transducers 
> (Lucene FST package). This issue is for implementing a relatively basic 
> prefix matcher, we will handle infixes and other types of input matches 
> gradually. Impl. phases:
> - write a DFA based suggester effectively identical to ternary tree based 
> solution right now,
> - baseline benchmark against tern. tree (memory consumption, rebuilding 
> speed, indexing speed; reuse Andrzej's benchmark code)
> - modify DFA to encode term weights directly in the automaton (optimize for 
> onlyMostPopular case)
> - benchmark again
> - add infix suggestion support with prefix matches boosted higher (?)
> - benchmark again
> - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to