Cassandra Targett created SOLR-10314:
----------------------------------------
Summary: Spellcheck with SnowballPorterFilterFactory and Synonyms
doesn't work well
Key: SOLR-10314
URL: https://issues.apache.org/jira/browse/SOLR-10314
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: spellchecker
Reporter: Cassandra Targett
Fix For: 6.4, 5.5
As noted in SOLR-10252, the default spellcheck configuration in the
data_driven_schema_configs (and basic_configs) uses the {{\_text_}} field as
the default field for spellcheck. This field is {{text_general}} field type.
If I use this default configuration for spellcheck, but modify the
{{text_general}} field to use the SnowballPorterFilterFactory (with
language=German in this case), and have synonyms in my analysis chain, queries
to the {{/spell}} request handler will fail when there are 2 or more terms
which are both preceded with a {{+}} operator.
Note that the default spellcheck configuration also enables
spellcheck.collation - if I disable that, I do not get any error. I also do not
get an error if I use only 1 term, even if it is spelled "correctly". If at
least one of the terms is spelled incorrectly, that also does not give an error.
So, in summary, there's a pretty specific list of variables at work here:
# {{/spell}} request handler
# 2 or more terms, both spelled correctly (or, both terms exist in the index)
# all terms required with {{+}}
# synonyms (there is a big list in this case, which I cannot share...see
SOLR-10252 for an example of the parsed query to see how big the list can get)
# SnowballPorterFilter
# spellcheck.collation=true
The error returned is:
{code}
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String
index out of range: -1
{code}
I made several experiments and found that if synonyms are removed from the
field type (and thus the query analysis chain), the query is successful with
collations enabled. So it's not SnowballPorterFilter by itself, but with {{+}}
and synonyms and collation.
The field type definition is:
{code}
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="German"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true"
synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German"/>
</analyzer>
</fieldType>
{code}
This problem was found with 5.5.2, but I verified it still exists in 6.4 and
6.5.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]