[jira] [Updated] (SOLR-10314) Spellcheck with SnowballPorterFilterFactory and Synonyms doesn't work well

Cassandra Targett (JIRA) Fri, 17 Mar 2017 09:35:59 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cassandra Targett updated SOLR-10314:
-------------------------------------
    Description: 
As noted in SOLR-10252, the default spellcheck configuration in the 
data_driven_schema_configs (and basic_configs) uses the {{\_text_}} field as 
the default field for spellcheck. This field is {{text_general}} field type.

If I use this default configuration for spellcheck, but modify the 
{{text_general}} field to use the SnowballPorterFilterFactory (with 
language=German in this case), and have synonyms in my analysis chain, queries 
to the {{/spell}} request handler will fail when there are 2 or more terms 
which are both preceded with a {{+}} operator. 

Note that the default spellcheck configuration also enables spellcheck.collate 
- if I disable that, I do not get any error. I also do not get an error if I 
use only 1 term, even if it is spelled "correctly". If at least one of the 
terms is spelled incorrectly, that also does not give an error.

So, in summary, there's a pretty specific list of variables at work here:

# {{/spell}} request handler
# 2 or more terms, both spelled correctly (or, both terms exist in the index)
# all terms required with {{+}}
# synonyms (there is a big list in this case, which I cannot share...see 
SOLR-10252 for an example of the parsed query to see how big the list can get)
# SnowballPorterFilter
# spellcheck.collate=true

The error returned is: 
{code}
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String 
index out of range: -1
{code}

I made several experiments and found that if synonyms are removed from the 
field type (and thus the query analysis chain), the query is successful with 
collations enabled. So it's not SnowballPorterFilter by itself, but with {{+}} 
and synonyms and collation.

The field type definition is:

{code}
  <fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" 
synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
    </analyzer>
  </fieldType>
{code}

This problem was found with 5.5.2, but I verified it still exists in 6.4 and 
6.5.

  was:
As noted in SOLR-10252, the default spellcheck configuration in the 
data_driven_schema_configs (and basic_configs) uses the {{\_text_}} field as 
the default field for spellcheck. This field is {{text_general}} field type.

If I use this default configuration for spellcheck, but modify the 
{{text_general}} field to use the SnowballPorterFilterFactory (with 
language=German in this case), and have synonyms in my analysis chain, queries 
to the {{/spell}} request handler will fail when there are 2 or more terms 
which are both preceded with a {{+}} operator. 

Note that the default spellcheck configuration also enables 
spellcheck.collation - if I disable that, I do not get any error. I also do not 
get an error if I use only 1 term, even if it is spelled "correctly". If at 
least one of the terms is spelled incorrectly, that also does not give an error.

So, in summary, there's a pretty specific list of variables at work here:

# {{/spell}} request handler
# 2 or more terms, both spelled correctly (or, both terms exist in the index)
# all terms required with {{+}}
# synonyms (there is a big list in this case, which I cannot share...see 
SOLR-10252 for an example of the parsed query to see how big the list can get)
# SnowballPorterFilter
# spellcheck.collation=true

The error returned is: 
{code}
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String 
index out of range: -1
{code}

I made several experiments and found that if synonyms are removed from the 
field type (and thus the query analysis chain), the query is successful with 
collations enabled. So it's not SnowballPorterFilter by itself, but with {{+}} 
and synonyms and collation.

The field type definition is:

{code}
  <fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" 
synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
    </analyzer>
  </fieldType>
{code}

This problem was found with 5.5.2, but I verified it still exists in 6.4 and 
6.5.


> Spellcheck with SnowballPorterFilterFactory and Synonyms doesn't work well
> --------------------------------------------------------------------------
>
>                 Key: SOLR-10314
>                 URL: https://issues.apache.org/jira/browse/SOLR-10314
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: spellchecker
>            Reporter: Cassandra Targett
>             Fix For: 5.5, 6.4
>
>
> As noted in SOLR-10252, the default spellcheck configuration in the 
> data_driven_schema_configs (and basic_configs) uses the {{\_text_}} field as 
> the default field for spellcheck. This field is {{text_general}} field type.
> If I use this default configuration for spellcheck, but modify the 
> {{text_general}} field to use the SnowballPorterFilterFactory (with 
> language=German in this case), and have synonyms in my analysis chain, 
> queries to the {{/spell}} request handler will fail when there are 2 or more 
> terms which are both preceded with a {{+}} operator. 
> Note that the default spellcheck configuration also enables 
> spellcheck.collate - if I disable that, I do not get any error. I also do not 
> get an error if I use only 1 term, even if it is spelled "correctly". If at 
> least one of the terms is spelled incorrectly, that also does not give an 
> error.
> So, in summary, there's a pretty specific list of variables at work here:
> # {{/spell}} request handler
> # 2 or more terms, both spelled correctly (or, both terms exist in the index)
> # all terms required with {{+}}
> # synonyms (there is a big list in this case, which I cannot share...see 
> SOLR-10252 for an example of the parsed query to see how big the list can get)
> # SnowballPorterFilter
> # spellcheck.collate=true
> The error returned is: 
> {code}
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String 
> index out of range: -1
> {code}
> I made several experiments and found that if synonyms are removed from the 
> field type (and thus the query analysis chain), the query is successful with 
> collations enabled. So it's not SnowballPorterFilter by itself, but with 
> {{+}} and synonyms and collation.
> The field type definition is:
> {code}
>   <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100" multiValued="true">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="German"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>       <filter class="solr.SynonymFilterFactory" expand="true" 
> ignoreCase="true" synonyms="synonyms.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="German"/>
>     </analyzer>
>   </fieldType>
> {code}
> This problem was found with 5.5.2, but I verified it still exists in 6.4 and 
> 6.5.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-10314) Spellcheck with SnowballPorterFilterFactory and Synonyms doesn't work well

Reply via email to