Regardless of how you change or don't change the examples, I just want to put in a plug for better documentation. A number of Solr users were hit by suprise when the default was changed in Solr/Lucene 3.5. I tried to find out how to modify/change the release notes to call attention to this but gave up too soon. See: http://lucene.472066.n3.nabble.com/autoGeneratePhraseQueries-sort-of-silently-set-to-false-tc3770638.html Tom Burton-West On Thu, Aug 9, 2012 at 1:25 PM, Yonik Seeley (JIRA) <[email protected]> wrote:
> > [ > https://issues.apache.org/jira/browse/SOLR-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432003#comment-13432003] > > Yonik Seeley commented on SOLR-3723: > ------------------------------------ > > bq. I think apps that want this behaviour should simply use > text_en_splitting. That's why we have that field type. > > We could also create a text_en_pureOr (or whatever name fits better) field > type that always interpreted a-b as (a OR B) and then apps that want that > behavior could use that. > > But we're also talking about what the best default for english (i.e. > text_en) in general is. > The defaults for "text" in general are a different question. Looking at > all of the arguments so far, my judgement is still that for text_en, > interpreting a-team as "a team" is far preferable to (a OR team) > > > > Improve OOTB behavior: English word-splitting should default to > autoGeneratePhraseQueries=true > > > ---------------------------------------------------------------------------------------------- > > > > Key: SOLR-3723 > > URL: https://issues.apache.org/jira/browse/SOLR-3723 > > Project: Solr > > Issue Type: Improvement > > Components: Schema and Analysis > > Affects Versions: 3.4, 3.5, 3.6, 4.0-ALPHA, 3.6.1 > > Reporter: Jack Krupansky > > > > Digging through the Jira and revision history, I discovered that back at > the end of May 2011, a change was made to Solr that fairly significantly > degrades the OOTB behavior for English Solr queries, namely for > word-splitting of terms with embedded punctuation, so that they end up, by > default, doing the OR of the sub-terms, rather than doing the obvious > phrase query of the sub-terms. > > Just a couple of examples: > > 1. CD-ROM => CD OR ROM rather than “CD ROM” > > 2. 1,000 => 1 OR 000 rather than “1 000” (when using the > WordDelimiterFilter innocently added to text_general or text_en) > > 3. out-of-the-box => out OR of OR the OR box rather than “out of the box” > > 4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter > innocently added to text_general or text_en) > > 5. docid-001 => docid OR 001 rather than "DOCID 001" > > All of those queries will give surprising and unexpected results. > > Note: The hyphen issue is present in StandardTokenizer, even if WDF is > not used. Side note: The full behavior of StandardTokenizer should be more > fully documented on the Analyzers wiki. > > Back to the history of the change, there was a lot of lively discussion > on SOLR-2015 - add a config hook for autoGeneratePhraseQueries. > > And the actual change to default to the behavior described above was > SOLR-2519 - improve defaults for text_* field types. > > (Consider the entire discussion in those two issues incorporated here > for reference. Anyone wishing to participate in discussion on this issue > would be well-advised to study those two issues first.) > > I gather that the original motivation was for non-European languages, > and that even some European languages might search better without > auto-phrase generation, but the decision to default English terms to NOT > automatically generate phrase queries and to generate OR queries instead is > rather surprising and unexpected and outright undesirable, as my examples > above show. > > I had been aware of the behavior for quite some time, but I had thought > it was simply a lingering bug so I paid little attention to it, until I > stumbled across this autoGeneratePhraseQueries "feature" while looking at > the query parser code. I can understand the need to disable automatic > phrase queries for SOME languages, but to disable it by default for English > seems rather bizarre, as my simple use cases above show. > > Even if no action is taken on this Jira, I feel that it is important > that there be a wider awareness of the significant and unexpected impact > from SOLR-2519, and that what had seemed like buggy behavior was done > intentionally. > > Unless there has been a change of heart since SOLR-2015/2519, I guess we > are stuck with the default TextField behavior, but at least we could > improve the example schema in several ways: > > 1. The English text field types should have > autoGeneratePhraseQueries=true. If a user innocently adds a word delimiter > to text_en, for example, they need to know that > autoGeneratePhraseQueries=true is needed. Better to preempt that confusion > and put the attribute in now. In fact, hyphenated terms fail as I have > noted above, so the addition is needed even if a WDF is not added. > > 2. Add commentary about the impact of > autoGeneratePhraseQueries=true/false - in terms of use case examples, as > above. Specifically note the ones that will break with if the feature is > disabled. > > Another, more controversial change will be: > > 3. Change text_general to autoGeneratePhraseQueries=true so that English > will be treated reasonably by default. I suspect that most European > languages will be at least "okay". A comment will note that this field > attribute should be removed or set to false for non-whitespace languages, > or that an alternative field type should be used. I suspect that the first > thing any non-whitespace language application will want to do is pick the > text field type that has analysis that makes the most sense for them, so I > see no need to mess up English for no good reason. > > Make no mistake, #3 is the primary and only real goal of this OOTB > > improvement. Maybe "text_general" could be kept as is for reference as > the purported "general" text field type (except that it doesn't work well > for English, as shown above), and maybe there should be a "text_default" > that I would propose should be a literal copy of text_en with commentary to > direct users to the other choices for language. > > I would note that text_ja already has autoGeneratePhraseQueries=false, > so I'm not sure why the default in the TextField code had to be changed to > false. Any languages for which automatic phrase query generation is > problematic should be attributed similarly. But, now that it is wired into > the schema defaults, we may be stuck with it. > > I was rather surprised that SOLR-2519 actually changed the default in > TextField rather than simply set the attribute as appropriate for the > various text field types. > > There are probably also a couple of places in the wikis where the > surprising behavior should be noted. There is literally no wiki > documentation for this important feature. There are only two references to > autoGeneratePhraseQueries, with no discussion of exactly what this feature > does or what the downside is if it is disabled. > > In the past, there was no need to document the treatment of embedded > word delimiters (well, okay, the poor handling for non-whitespace languages > SHOULD have been documented), but now there is no documentation of the > degradation of what was a default and implicit feature that a lot of people > assume should be automatic. > > And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the > kinds of use cases that unsuspecting users may not realize were BROKEN by > the commit of SOLR-2519 that is masked under the innocent phrasing of > "improve defaults for text_* field types". How many users seriously > understood that a query with embedded dashes and commas behave differently > as a result of that change? > > I am contemplating whether to suggest that the WordDelimiterFilter > should also be part of the default text field type. Right now, it is hidden > off in text_en_splitting. > > I think stemming should also be part of the default English field type. > The whole point of the "example" schema is to show-off the best of > Lucene/Solr. > > I'm not quite ready to propose that English be the default language > supported by the example schema, but I am 99.999% certain that we should > focus it on European, Roman, Latin languages. Non-European languages are > indeed important, and should probably have their own schema. text_general > was a good idea, but in hindsight it appears to have not been such a great > idea in light of the word-splitting problems I have highlighted above. > > Maybe I would propose that text_general be left as is, but that we add > text_default which is a copy of text_en (which would have WDF and stemming > added) and fields use text_default as their type. That way, it would be > clear what is going on and users could sensibly see what needs to happen if > they wish to switch default languages. > > After discussion settles, a revised final proposal will be composed. And > some specific and non-controversial issues may be split into separate Jira > issues. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
