Problems with Spellchecker in 3.1

2011-04-26 Thread Bob Sandiford
Hi, all.

Sorry for any duplication - seems like what I sent yesterday never made it 
through...


We're having some troubles with the Solr Spellcheck Response.  We're running 
version 3.1.



Overview:  If we search for something really ugly like:



  "kljhklsdjahfkljsdhf book rck"



then when we get back the response, there's a suggestions list for 'rck', but 
no suggestions list for the other two words.  For 'book', that's fine, because 
it is 'spelled correctly' (i.e. we got hits on the word) and there shouldn't be 
any suggestions.  For the ugly thing, though, there aren't any hits.



The problem is that when we're handling the result, we can't tell the 
difference between no suggestions for a 'correctly spelled' term, and no 
suggestions for something that's odd like this.



(Now - this is happening with searches that aren't as obviously garbage - i.e. 
words that are real words, just that just don't show up in the index and have 
no suggestions - this was just to illustrate the point).



Our setup:

We're running multiple shards, which may be part of the issue.  For example, 
'book' might be found in one of the shards, but not another.



I don't *think* this has anything to do with our schema, since it's really how 
the Search Suggestions are being returned to us.  But, here are some bits and 
pieces:

>From schema.xml:



   

   





>From solrconfig.xml:



   

  



textSpell





  default

  textSpell

  ./spellchecker





  



What we'd really like to see is the response coming back with an indication 
that a word wasn't found / had no suggestions.  We've hacked around in the code 
a little bit to do this, but were wondering if anyone has come across this, and 
what approaches you've taken.



We created new classes which extend IndexBasedSpellChecker and 
SpellCheckComponent, as follows (package and imports excluded for (sort of) 
brevity).  The methods are as taken from the overridden classes, with changes 
noted by "SD" type comments...





/**

* This has a slight modification of Solr's 
AbstractLuceneSpellChecker.getSuggestions(SpellingOptions).

* The modification allows correctly spelled words to be returned in the 
suggestion.  This modification working in tandem

* with the SirsiDynixSpellCheckComponent allows words with no suggestions to be 
returned from the spell check component

* even in a sharded search.

* Changes are marked with SD in the comments.

*/

public class SirsiDynixIndexBasedSpellChecker extends IndexBasedSpellChecker{

  @Override

  public SpellingResult getSuggestions(SpellingOptions options) throws 
IOException {

  boolean shardRequest = false;

  SolrParams params = options.customParams;

  if(params!=null)

  {

shardRequest = "true".equals(params.get(ShardParams.IS_SHARD));

  }

SpellingResult result = new SpellingResult(options.tokens);

IndexReader reader = determineReader(options.reader);

Term term = field != null ? new Term(field, "") : null;

float theAccuracy = (options.accuracy == Float.MIN_VALUE) ? 
spellChecker.getAccuracy() : options.accuracy;



int count = Math.max(options.count, 
AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT);

for (Token token : options.tokens) {

  String tokenText = new String(token.buffer(), 0, token.length());

  String[] suggestions = spellChecker.suggestSimilar(tokenText,

  count,

field != null ? reader : null, //workaround LUCENE-1295

field,

options.onlyMorePopular, theAccuracy);

  if (suggestions.length == 1 && suggestions[0].equals(tokenText)) {

//These are spelled the same, continue on

List suggList = Arrays.asList(suggestions); //SD added

result.add(token, suggList);//SD added

continue;

  }



  if (options.extendedResults == true && reader != null && field != null) {

term = term.createTerm(tokenText);

result.add(token, reader.docFreq(term));

int countLimit = Math.min(options.count, suggestions.length);

if(countLimit>0)

{

  for (int i = 0; i < countLimit; i++) {

term = term.createTerm(suggestions[i]);

result.add(token, suggestions[i], reader.docFreq(term));

  }

} else if(shardRequest) {

List suggList = Collections.emptyList();

result.add(token, suggList);

}

  } else {

if (suggestions.length > 0) {

  List suggList = Arrays.asList(suggestions);

  if (suggestions.length > options.count) {

suggList = suggList.subList(0, options.count);

  }

  result.add(token, suggList);

} else if(shardRequest) {

List suggList = Collections.emptyList();

result.add(token, suggList);

}

  }

}

return result;

  }

}







/**

* This is a

Problems with Spellchecker in 3.1

2011-04-25 Thread Bob Sandiford
Oops.  Sorry.  I'm hijacking my own thread to put a real Subject in place...

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com 


> -Original Message-
> From: Bob Sandiford
> Sent: Monday, April 25, 2011 5:34 PM
> To: solr-user@lucene.apache.org
> Subject:
> 
> Hi, all.
> 
> We're having some troubles with the Solr Spellcheck Response.  We're
> running version 3.1.
> 
> Overview:  If we search for something really ugly like:  "
> kljhklsdjahfkljsdhf book rck"
> 
> then when we get back the response, there's a suggestions list for
> 'rck', but no suggestions list for the other two words.  For 'book',
> that's fine, because it is 'spelled correctly' (i.e. we got hits on the
> word) and there shouldn't be any suggestions.  For the ugly thing,
> though, there aren't any hits.
> 
> The problem is that when we're handling the result, we can't tell the
> difference between no suggestions for a 'correctly spelled' term, and
> no suggestions for something that's odd like this.
> 
> (Now - this is happening with searches that aren't as obviously garbage
> - this was just to illustrate the point).
> 
> Our setup:
> We're running multiple shards, which may be part of the issue.  For
> example, 'book' might be found in one of the shards, but not another.
> 
> I don't *think* this has anything to do with our schema, since it's
> really how the Search Suggestions are being returned to us.
> 
> What we'd really like to see is the response coming back with an
> indication that a word wasn't found / had no suggestions.  We've hacked
> around in the code a little bit to do this, but were wondering if
> anyone has come across this, and what approaches you've taken.
> 
> Here's the xml we're getting back from the search:
> 
> 
> 
> 
> 
> 
>   0
>   56
>   
> true
> true
> score desc, RELEVANCE_SORT_nsort desc
> spellcheckedStandard
> true
> 1000
> true
>  ELECTRONIC_ACCESS_display ISBN_display TITLE_boost
> FORMAT_display score MEDIA_TYPE_display AUTHOR_boost LOCALURL_display
> UPC_display id DOC_ID_display CHILD_SITE_display DS_EC
> PRIMARY_AUTHOR_boost PRIMARY_TITLE_boost DS_ID TOPIC_display
> ASSET_NAME_display OCLC_display
>  name="shards">localhost:8983/solr/SD_ILS/,localhost:8983/solr/SD_ASSET/
> 
> 
>   AUTHOR_facet
>   FORMAT_facet
>   LANGUAGE_facet
>   PUBDATE_nfacet
>   SUBJECT_facet
>   ABCDEF_cfacet
> 
> spellcheckedStandard
> 
>   ACCESS_LEVEL_nfacet:"0"
>   CLEARANCE_nfacet:"0"
>   NEED_TO_KNOWS_facet:"@@EMPTY@@"
>   CITIZENSHIPS_facet:"@@EMPTY@@"
>   RESTRICTIONS_facet:"@@EMPTY@@"
> 
> 1
> true
> *
> 12
> 5
> 0
> TITLE_boost:"kljhklsdjahfkljsdhf book rck"~100^200.0
> OR PRIMARY_AUTHOR_boost:"kljhklsdjahfkljsdhf book rck"~100^100.0 OR
> DOC_TEXT:"kljhklsdjahfkljsdhf book rck"~100^2 OR
> PRIMARY_TITLE_boost:"kljhklsdjahfkljsdhf book rck"~100^1000.0 OR
> AUTHOR_boost:"kljhklsdjahfkljsdhf book rck"~100^20.0 OR
> textFuzzy:kljhklsdjahfkljsdhf~0.7 AND textFuzzy:book~0.7 AND
> textFuzzy:rck~0.7
>   
> 
> 
> 
>   
>   
> 
> 
> 
> 
> 
> 
>   
>   
>   
> 
> 
> 
>   
> 
>   5
>   362
>   365
>   0
>   
> 
>   rock
>   24000
> 
> 
>   rick
>   6048
> 
> 
>   rack
>   84
> 
> 
>   reck
>   78
> 
> 
>   ruck
>   30
> 
>   
> 
> false
>   
> 
> 
> 
> 
> 
> Thanks!
> 
> Bob Sandiford | Lead Software Engineer | SirsiDynix
> P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
> www.sirsidynix.com