[jira] [Commented] (SOLR-9429) Spellcheck Token Filter

Alessandro Benedetti (JIRA) Wed, 24 Aug 2016 01:47:43 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434509#comment-15434509
 ]


Alessandro Benedetti commented on SOLR-9429:
--------------------------------------------

Hi David, I updated the description, unfortunately It is not an entire blog 
post but only a blog comment, it is related a possible application at 
autocompletion time.

I was checking in Solr and Lucene token filters to see if any of them was doing 
this out of the box, but I didn't find one.
Blog post ( related autocompletion) :

http://alexbenedetti.blogspot.co.uk/2015/07/solr-you-complete-me.html

The original conversation :
"About getting matches for "Video gamign" using FuzzyLookupFactory, what if we 
apply analysis on spelling correction of "gamign", i.e., "gaming" to get 
stemmed tokens. This way we get results.

Alessandro Benedetti23 August 2016 at 10:52
Hi Shyamsunder, you mean using an analyzer that performs spell correction ( 
dictionary based ? ) and then stemming ?
It could be possible.
First we define a TokenFilter that does the spell correction based on a 
dictionary ( it is actually a good idea, but I think it doesn't exist out of 
the box).
Then we can specify a stemming token filter, and the game is done.

This is actually a good idea, and can be potentially useful is a number of use 
cases :

https://issues.apache.org/jira/browse/SOLR-9429

Shyamsunder23 August 2016 at 23:14
You got it. Thanks for considering my idea. "



> Spellcheck Token Filter
> -----------------------
>
>                 Key: SOLR-9429
>                 URL: https://issues.apache.org/jira/browse/SOLR-9429
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>            Reporter: Alessandro Benedetti
>            Priority: Minor
>
> This issue is about the design and implementation of a new token filter 
> called : SpellcheckTokenFilter
> This new token filter takes in input the token stream and return collated 
> tokens, based on a Dictionary.
> The aim of the token filter is to fix mispelled word and index the correct 
> token.
> e.g.
> Given dictionary d1 :
> gaming
> gamer
> Given text t1 for the field f1 :
> gamign is a strong industry
> The token filter will return in output :
> gaming is a strong industry
> A first possible design is to mimic the approach used in the spellchecker.
> Building an FST for the dictionary, then building the levenstein FST for each 
> token and doing the intersection .
> Possible application could be for OCR generated text and other use cases when 
> misspelled words are common and we want to clean them up at indexing time.
> This can possibly be used in a complex analyser adding a stemmer afterward.
> This is draft idea coming from a blog comment of Shyamsunder.
> Feedback and additional ideas are welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9429) Spellcheck Token Filter

Reply via email to