[jira] [Commented] (LUCENE-3842) Analyzing Suggester

Robert Muir (JIRA) Mon, 17 Sep 2012 19:26:09 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457552#comment-13457552
 ]


Robert Muir commented on LUCENE-3842:
-------------------------------------

{quote}
I think it would be better/cleaner to append unique (disambiguating) bytes to 
the end of the analyzed bytes (this was Robert's original idea): then each path 
is a single result. The only downside I can think of is we will have to reserve 
a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, 
... but since these input BytesRefs are "typically" UTF-8 ... this seems not so 
bad? Then can of course in general be arbitrary bytes since they are produced 
by the analysis process...
{quote}

I don't understand why we have to reserve any bytes. We can append arbitrary 
bytes of any sort to the end of the input side, this will have no effect on the 
actual surface form that we suggest.

                
> Analyzing Suggester
> -------------------
>
>                 Key: LUCENE-3842
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3842
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/spellchecker
>    Affects Versions: 3.6, 4.0-ALPHA
>            Reporter: Robert Muir
>         Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
> LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
> LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
> LUCENE-3842.patch, LUCENE-3842.patch, 
> LUCENE-3842-TokenStream_to_Automaton.patch
>
>
> Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
> comparator in LUCENE-3801,
> I think we should look at implementing suggesters that have more capabilities 
> than just basic prefix matching.
> In particular I think the most flexible approach is to integrate with 
> Analyzer at both build and query time,
> such that we build a wFST with:
> input: analyzed text such as ghost0christmas0past <-- byte 0 here is an 
> optional token separator
> output: surface form such as "the ghost of christmas past"
> weight: the weight of the suggestion
> we make an FST with PairOutputs<weight,output>, but only do the shortest path 
> operation on the weight side (like
> the test in LUCENE-3801), at the same time accumulating the output (surface 
> form), which will be the actual suggestion.
> This allows a lot of flexibility:
> * Using even standardanalyzer means you can offer suggestions that ignore 
> stopwords, e.g. if you type in "ghost of chr...",
>   it will suggest "the ghost of christmas past"
> * we can add support for synonyms/wdf/etc at both index and query time (there 
> are tradeoffs here, and this is not implemented!)
> * this is a basis for more complicated suggesters such as Japanese 
> suggesters, where the analyzed form is in fact the reading,
>   so we would add a TokenFilter that copies ReadingAttribute into term text 
> to support that...
> * other general things like offering suggestions that are more "fuzzy" like 
> using a plural stemmer or ignoring accents or whatever.
> According to my benchmarks, suggestions are still very fast with the 
> prototype (e.g. ~ 100,000 QPS), and the FST size does not
> explode (its short of twice that of a regular wFST, but this is still far 
> smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3842) Analyzing Suggester

Reply via email to