[jira] [Commented] (SOLR-4277) Spellchecker sometimes falsely reports a spelling error and correction

James Dyer (JIRA) Mon, 07 Jan 2013 07:52:14 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545976#comment-13545976
 ]


James Dyer commented on SOLR-4277:
----------------------------------

Jack, 

Can you take a look at 
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount 
and also SOLR-2585.  I think at least some of what you describe is intended 
behavior.  For instance, if you specify "spellcheck.alternateTermCount", then 
it will try and give you suggestions for every word in the query.  The purpose 
being that "alternateTermCount" is to allow you to generate "did-you-mean" 
style suggestions.  It also helps in cases where a word is misspelled but is 
still a valid word in the dictionary.  (For instance, see the "flew form 
Heathrow" example in Manning's IR book section 3.3.5)

I agree your humorous example with "clock check" is indeed quite absurd.  
Perhaps we can add a feature to prevent it from reduplicating terms when 
generating a collation?  But keep in mind if in fact the user queried ("6 
o'clock" check) ... or ... (+clock -check) ... or ... (foo:clock AND bar:check) 
... then these corrections might indeed change the semantics of the query.  I 
would think that such a feature needs to be smart about these sorts of 
scenarios as well (much like is done when correcting Word Breaks, see 
SOLR-2993).

Short of a new feature to prevent reduplication, the current way to avoid this 
from happening (most of the time) is to specify "maxResultsForSuggest=n".  This 
would mean you won't get suggestions unless your query returns n or fewer 
results.  Obviously this feature is hard to exercise on a very small index.

Also, if you had an index with a lot of documents and you specified 
"alternateTermCount" with "maxResultsForSuggest>0", then you wouldn't be 
getting suggestions at all unless your hit-count was below the threshold.

                
> Spellchecker sometimes falsely reports a spelling error and correction
> ----------------------------------------------------------------------
>
>                 Key: SOLR-4277
>                 URL: https://issues.apache.org/jira/browse/SOLR-4277
>             Project: Solr
>          Issue Type: Bug
>          Components: spellchecker
>    Affects Versions: 4.0
>            Reporter: Jack Krupansky
>
> In some cases, the Solr spell checker improperly reports query terms as being 
> misspelled.
> Using the Solr example for 4.0, I added these mini documents:
> {code}
> curl http://localhost:8983/solr/update?commit=true -H 
> 'Content-type:application/csv' -d '
> id,name
> spel-1,aardvark abacus ball bill cat cello
> spel-2,abate accord band bell cattle check
> spel-3,adorn border clean clock'
> {code}
> I then issued this request:
> {code}
> curl "http://localhost:8983/solr/spell/?q=check&indent=true";
> {code}
> The spell checker falsely concluded that "check" was misspelled and 
> improperly corrected it to "clock":
> {code}
> <lst name="spellcheck">
>   <lst name="suggestions">
>     <lst name="check">
>       <int name="numFound">1</int>
>       <int name="startOffset">0</int>
>       <int name="endOffset">5</int>
>       <int name="origFreq">1</int>
>       <arr name="suggestion">
>         <lst>
>           <str name="word">clock</str>
>           <int name="freq">1</int>
>         </lst>
>       </arr>
>     </lst>
>     <bool name="correctlySpelled">false</bool>
>     <lst name="collation">
>       <str name="collationQuery">clock</str>
>       <int name="hits">1</int>
>       <lst name="misspellingsAndCorrections">
>         <str name="check">clock</str>
>       </lst>
>     </lst>
>   </lst>
> </lst>
> {code}
> And if I query for "clock", it gets corrected to "check"!
> {code}
> curl "http://localhost:8983/solr/spell/?q=clock&indent=true";
> {code}
> {code}
>   <lst name="suggestions">
>     <lst name="clock">
>       <int name="numFound">1</int>
>       <int name="startOffset">0</int>
>       <int name="endOffset">5</int>
>       <int name="origFreq">1</int>
>       <arr name="suggestion">
>         <lst>
>           <str name="word">check</str>
>           <int name="freq">1</int>
>         </lst>
>       </arr>
>     </lst>
>     <bool name="correctlySpelled">false</bool>
>     <lst name="collation">
>       <str name="collationQuery">check</str>
>       <int name="hits">1</int>
>       <lst name="misspellingsAndCorrections">
>         <str name="clock">check</str>
>       </lst>
>     </lst>
>   </lst>
> {code}
> Note: This appears to be only because "clock" is so close to "check". With 
> other terms I don't see the problem:
> {code}
> curl "http://localhost:8983/solr/spell/?q=cattle+abate+check&indent=true";
> {code}
> {code}
>   <lst name="suggestions">
>     <lst name="check">
>       <int name="numFound">1</int>
>       <int name="startOffset">13</int>
>       <int name="endOffset">18</int>
>       <int name="origFreq">1</int>
>       <arr name="suggestion">
>         <lst>
>           <str name="word">clock</str>
>           <int name="freq">1</int>
>         </lst>
>       </arr>
>     </lst>
>     <bool name="correctlySpelled">false</bool>
>     <lst name="collation">
>       <str name="collationQuery">cattle abate clock</str>
>       <int name="hits">2</int>
>       <lst name="misspellingsAndCorrections">
>         <str name="cattle">cattle</str>
>         <str name="abate">abate</str>
>         <str name="check">clock</str>
>       </lst>
>     </lst>
>   </lst>
> {code}
> Although, it inappropriately lists "cattle" and "abate" in the "misspellings" 
> section even though no suggestions were offered.
> Finally, I can workaround this issue by removing the following line from 
> solrconfig.xml:
> {code}
>       <str name="spellcheck.alternativeTermCount">5</str>
> {code}
> Which responds to the previous request with:
> {code}
>   <lst name="suggestions">
>     <bool name="correctlySpelled">false</bool>
>   </lst>
> {code}
> Which makes the original problem go away. Although, it does beg the question 
> as to why my 100% correct query is still tagged as "correctlySpelled" = 
> "false", but that's a separate Jira.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4277) Spellchecker sometimes falsely reports a spelling error and correction

Reply via email to