Analyzing Suggester
-------------------

                 Key: LUCENE-3842
                 URL: https://issues.apache.org/jira/browse/LUCENE-3842
             Project: Lucene - Java
          Issue Type: New Feature
          Components: modules/spellchecker
    Affects Versions: 3.6, 4.0
            Reporter: Robert Muir


Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
comparator in LUCENE-3801,
I think we should look at implementing suggesters that have more capabilities 
than just basic prefix matching.

In particular I think the most flexible approach is to integrate with Analyzer 
at both build and query time,
such that we build a wFST with:
input: analyzed text such as ghost0christmas0past <-- byte 0 here is an 
optional token separator
output: surface form such as "the ghost of christmas past"
weight: the weight of the suggestion

we make an FST with PairOutputs<weight,output>, but only do the shortest path 
operation on the weight side (like
the test in LUCENE-3801), at the same time accumulating the output (surface 
form), which will be the actual suggestion.

This allows a lot of flexibility:
* Using even standardanalyzer means you can offer suggestions that ignore 
stopwords, e.g. if you type in "ghost of chr...",
  it will suggest "the ghost of christmas past"
* we can add support for synonyms/wdf/etc at both index and query time (there 
are tradeoffs here, and this is not implemented!)
* this is a basis for more complicated suggesters such as Japanese suggesters, 
where the analyzed form is in fact the reading,
  so we would add a TokenFilter that copies ReadingAttribute into term text to 
support that...
* other general things like offering suggestions that are more "fuzzy" like 
using a plural stemmer or ignoring accents or whatever.

According to my benchmarks, suggestions are still very fast with the prototype 
(e.g. ~ 100,000 QPS), and the FST size does not
explode (its short of twice that of a regular wFST, but this is still far 
smaller than TST or JaSpell, etc).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to