[
https://issues.apache.org/jira/browse/SOLR-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960972#comment-15960972
]
James Dyer commented on SOLR-10252:
-----------------------------------
Even before SOLR-8381, the example used "text_general" / "text" for the default
spellcheck analyzer and field. (cf. c52030b) I see this field type uses
synonyms and stop-words, but not stemming. Stemming would be a problem for a
lot of users for a spellcheck field, but synonyms and stopwords are maybe not
so bad. I would imagine the case with the large overlapping synonym list does
not occur typically, and if it does occur, then does all of this additional
spell-checking cause a significant performance problem?
That's the problem with one-size-fits-all configurations, but new users have to
start somewhere. We could make this better, perhaps:
- we could put a comment to explain why you might not want "text_general" for
spellcheck in production.
- we could add a second catch-all copyField that uses "text_ws" so users would
have an optimal spellcheck field. But this seems like a lot of overhead to
solve what might be an edge-case problem.
- we could make the spellcheck field be something like "spell_ws" and put a
comment there that users need to actually index content with this field name
for spellcheck to work. This avoids both problems but also makes spellcheck
harder to enable for new users.
thoughts?
> Example spellcheck config uses _text_ as default field
> ------------------------------------------------------
>
> Key: SOLR-10252
> URL: https://issues.apache.org/jira/browse/SOLR-10252
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: examples, spellchecker
> Affects Versions: 6.4.2
> Reporter: Cassandra Targett
>
> SOLR-8381 made the {{\_text_}} field the default field for spellchecking for
> the basic_configs and data_driven_schema_configs example configsets. This is
> a copyField that gets all it's data from every other field in the index.
> This field is also of text_general type, which has a default analysis chain
> that includes stopwords and synonyms. If someone has a large synonym list,
> perhaps with a lot of overlapping matches, this would cause spell checking to
> occur on every one of those terms. I recently saw a parsed query that looked
> like this:
> {code}"+(((_text_:partn _text_:gesellschaft _text_:teilhab _text_:konkubinat
> _text_:eheahn _text_:eheahn _text_:konkubinatspaar _text_:konkubinatspartn
> _text_:konkubinatsvertrag _text_:lebenspartn _text_:nichteheahn
> _text_:nichteheahn _text_:nichtehe _text_:wild _text_:registriert
> _text_:eingetrag _text_:eingetrag _text_:registriert _text_:vertragspartei
> _text_:kontrahent _text_:partei _text_:vertragspartn)/no_coord)
> ((_text_:gemeinschaft _text_:lebensgemeinschaft _text_:gemeinschaft
> _text_:lebensgemeinschaft _text_:lebensgemeinschaft _text_:ehe
> _text_:partnerschaft _text_:partnerschaft _text_:partn
> _text_:partnerschaft)/no_coord) _text_:gleichgeschlecht _text_:paar)
> +_text_:gestorb"
> {code}
> Since we recommend that users use a lightly analyzed field for spell
> checking, using {{\_text_}} and text_general seems a problematic example for
> us to start people out with. The example above is a lot of extra work for
> little reason.
> I'm not sure what a better field is - those two examples are minimal by
> design, and we can't be sure what field they might have in the index to make
> it work out of the box. However, perhaps we can consider a better field type?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]