Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Jack Krupansky Wed, 08 Aug 2012 17:14:11 -0700

Digging through the Jira and revision history, I discovered that back at theend of May 2011, a change was made to Solr that fairly significantlydegrades the OOTB behavior for Solr queries, namely for word-splitting ofterms with embedded punctuation, so that they end up, by default, doing theOR of the sub-terms, rather than doing the obvious phrase query of thesub-terms.


Just a couple of examples:


CD-ROM => CD OR ROM rather than “CD ROM”
1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter)
out-of-the-box => out OR of OR the OR box rather than “out of the box”
3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter)
docid-001 => docid OR 001 rather than "DOCID 001"

All of those queries will give surprising and unexpected results.

Back to the history of the change, there was a lot of lively discussion onSOLR-2015 - add a config hook for autoGeneratePhraseQueries:

https://issues.apache.org/jira/browse/SOLR-2015

And the actual change to default to the behavior described above wasSOLR-2519 - improve defaults for text_* field types:

https://issues.apache.org/jira/browse/SOLR-2519

I gather that the original motivation was for non-European languages, andthat even some European languages might search better without auto-phrasegeneration, but the decision to default English terms to NOT automaticallygenerate phrase queries and to generate OR queries instead is rathersurprising and unexpected and outright undesirable, as my examples aboveshow.

I had been aware of the behavior for quite some time, but I had thought itwas simply a lingering bug so I paid little attention to it, until Istumbled across this autoGeneratePhraseQueries "feature" while looking atthe query parser code. I can understand the need to disable automatic phrasequeries for SOME languages, but to disable it by default for English seemsrather bizarre, as my simple use cases above show.

I'll file this as a Jira, but I wanted to call wider attention to it in caseothers were as unaware as me that what had seemed like buggy behavior wasdone intentionally.

Unless there has been a change of heart since SOLR-2015/2519, I guess we arestuck with the default TextField behavior, but at least we could improve theexample schema in several ways:


1. The English text field types should have autoGeneratePhraseQueries=true.

2. Add commentary about the impact of autoGeneratePhraseQueries=true/false -in terms of use case examples, as above. Specifically note the ones thatwill break with if the feature is disabled.


Another, more controversial change will be:

3. Change text_general to autoGeneratePhraseQueries=true so that Englishwill be treated reasonably by default. I suspect that most Europeanlanguages will be at least "okay". A comment will note that this fieldattribute should be removed or set to false for non-whitespace languages, orthat an alternative field type should be used. I suspect that the firstthing any non-whitespace language application will want to do is pick thetext field type that has analysis that makes the most sense for them, so Isee no need to mess up English for no good reason.

Make no mistake, #3 is the primary and only real goal of this OOTBimprovement. Maybe "text_general" could be kept as is for reference as thepurported "general" text field type (except that it doesn't work well forEnglish. as shown above), and maybe there should be a "text_default" that Iwould propose should be text_en with commentary to direct users to the otherchoices for language.

I would note that text_ja already has autoGeneratePhraseQueries=false, soI'm not sure why the default in the TextField code had to be changed tofalse. Any languages for which automatic phrase query generation isproblematic should be attributed similarly. But, now that it is wired intothe schema defaults, we may be stuck with it.

I was rather surprised that SOLR-2519 actually changed the default inTextField rather than simply set the attribute as appropriate for thevarious text field types.

There are probably also a couple of places in the wikis where the surprisingbehavior should be noted.

And, I would propose that the 4.0 CHANGES.TXT very clearly highlight thekinds of use cases that unsuspecting users may not realize were BROKEN bythe commit of SOLR-2519 that is masked under the innocent phrasing of"improve defaults for text_* field types". How many users seriouslyunderstood that a query with embedded dashes and commas behave differentlyas a result of that change?

I am contemplating whether to suggest that the WordDelimiterFilter shouldalso be part of the default text field type. Right now, it is hidden off intext_en_splitting.

I'll file the Jira tomorrow. Feel free to hold off comments until the Jiraappears.

-- Jack Krupansky


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Reply via email to