Re: [jira] [Commented] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Tom Burton-West Thu, 09 Aug 2012 13:12:44 -0700

Regardless of how you change or don't change the examples, I just want to
put in a plug for better documentation.  A number of Solr users were hit by
suprise when the default was changed in Solr/Lucene 3.5.  I tried to find
out how to modify/change the release notes to call attention to this but
gave up too soon.  See:
http://lucene.472066.n3.nabble.com/autoGeneratePhraseQueries-sort-of-silently-set-to-false-tc3770638.html
Tom Burton-West
On Thu, Aug 9, 2012 at 1:25 PM, Yonik Seeley (JIRA) <[email protected]> wrote:


>
>     [
> https://issues.apache.org/jira/browse/SOLR-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432003#comment-13432003]
>
> Yonik Seeley commented on SOLR-3723:
> ------------------------------------
>
> bq. I think apps that want this behaviour should simply use
> text_en_splitting. That's why we have that field type.
>
> We could also create a text_en_pureOr (or whatever name fits better) field
> type that always interpreted a-b as (a OR B) and then apps that want that
> behavior could use that.
>
> But we're also talking about what the best default for english (i.e.
> text_en) in general is.
> The defaults for "text" in general are a different question.  Looking at
> all of the arguments so far, my judgement is still that for text_en,
> interpreting a-team as "a team" is far preferable to (a OR team)
>
>
> > Improve OOTB behavior: English word-splitting should default to
> autoGeneratePhraseQueries=true
> >
> ----------------------------------------------------------------------------------------------
> >
> >                 Key: SOLR-3723
> >                 URL: https://issues.apache.org/jira/browse/SOLR-3723
> >             Project: Solr
> >          Issue Type: Improvement
> >          Components: Schema and Analysis
> >    Affects Versions: 3.4, 3.5, 3.6, 4.0-ALPHA, 3.6.1
> >            Reporter: Jack Krupansky
> >
> > Digging through the Jira and revision history, I discovered that back at
> the end of May 2011, a change was made to Solr that fairly significantly
> degrades the OOTB behavior for English Solr queries, namely for
> word-splitting of terms with embedded punctuation, so that they end up, by
> default, doing the OR of the sub-terms, rather than doing the obvious
> phrase query of the sub-terms.
> > Just a couple of examples:
> > 1. CD-ROM => CD OR ROM rather than “CD ROM”
> > 2. 1,000 => 1 OR 000 rather than “1 000” (when using the
> WordDelimiterFilter innocently added to text_general or text_en)
> > 3. out-of-the-box => out OR of OR the OR box rather than “out of the box”
> > 4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter
> innocently added to text_general or text_en)
> > 5. docid-001 => docid OR 001 rather than "DOCID 001"
> > All of those queries will give surprising and unexpected results.
> > Note: The hyphen issue is present in StandardTokenizer, even if WDF is
> not used. Side note: The full behavior of StandardTokenizer should be more
> fully documented on the Analyzers wiki.
> > Back to the history of the change, there was a lot of lively discussion
> on SOLR-2015 - add a config hook for autoGeneratePhraseQueries.
> > And the actual change to default to the behavior described above was
> SOLR-2519 - improve defaults for text_* field types.
> > (Consider the entire discussion in those two issues incorporated here
> for reference. Anyone wishing to participate in discussion on this issue
> would be well-advised to study those two issues first.)
> > I gather that the original motivation was for non-European languages,
> and that even some European languages might search better without
> auto-phrase generation, but the decision to default English terms to NOT
> automatically generate phrase queries and to generate OR queries instead is
> rather surprising and unexpected and outright undesirable, as my examples
> above show.
> > I had been aware of the behavior for quite some time, but I had thought
> it was simply a lingering bug so I paid little attention to it, until I
> stumbled across this autoGeneratePhraseQueries "feature" while looking at
> the query parser code. I can understand the need to disable automatic
> phrase queries for SOME languages, but to disable it by default for English
> seems rather bizarre, as my simple use cases above show.
> > Even if no action is taken on this Jira, I feel that it is important
> that there be a wider awareness of the significant and unexpected impact
> from SOLR-2519, and that what had seemed like buggy behavior was done
> intentionally.
> > Unless there has been a change of heart since SOLR-2015/2519, I guess we
> are stuck with the default TextField behavior, but at least we could
> improve the example schema in several ways:
> > 1. The English text field types should have
> autoGeneratePhraseQueries=true. If a user innocently adds a word delimiter
> to text_en, for example, they need to know that
> autoGeneratePhraseQueries=true is needed. Better to preempt that confusion
> and put the attribute in now. In fact, hyphenated terms fail as I have
> noted above, so the addition is needed even if a WDF is not added.
> > 2. Add commentary about the impact of
> autoGeneratePhraseQueries=true/false - in terms of use case examples, as
> above. Specifically note the ones that will break with if the feature is
> disabled.
> > Another, more controversial change will be:
> > 3. Change text_general to autoGeneratePhraseQueries=true so that English
> will be treated reasonably by default. I suspect that most European
> languages will be at least "okay". A comment will note that this field
> attribute should be removed or set to false for non-whitespace languages,
> or that an alternative field type should be used. I suspect that the first
> thing any non-whitespace language application will want to do is pick the
> text field type that has analysis that makes the most sense for them, so I
> see no need to mess up English for no good reason.
> > Make no mistake, #3 is the primary and only real goal of this OOTB
> > improvement. Maybe "text_general" could be kept as is for reference as
> the purported "general" text field type (except that it doesn't work well
> for English, as shown above), and maybe there should be a "text_default"
> that I would propose should be a literal copy of text_en with commentary to
> direct users to the other choices for language.
> > I would note that text_ja already has autoGeneratePhraseQueries=false,
> so I'm not sure why the default in the TextField code had to be changed to
> false. Any languages for which automatic phrase query generation is
> problematic should be attributed similarly. But, now that it is wired into
> the schema defaults, we may be stuck with it.
> > I was rather surprised that SOLR-2519 actually changed the default in
> TextField rather than simply set the attribute as appropriate for the
> various text field types.
> > There are probably also a couple of places in the wikis where the
> surprising behavior should be noted. There is literally no wiki
> documentation for this important feature. There are only two references to
> autoGeneratePhraseQueries, with no discussion of exactly what this feature
> does or what the downside is if it is disabled.
> > In the past, there was no need to document the treatment of embedded
> word delimiters (well, okay, the poor handling for non-whitespace languages
> SHOULD have been documented), but now there is no documentation of the
> degradation of what was a default and implicit feature that a lot of people
> assume should be automatic.
> > And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the
> kinds of use cases that unsuspecting users may not realize were BROKEN by
> the commit of SOLR-2519 that is masked under the innocent phrasing of
> "improve defaults for text_* field types". How many users seriously
> understood that a query with embedded dashes and commas behave differently
> as a result of that change?
> > I am contemplating whether to suggest that the WordDelimiterFilter
> should also be part of the default text field type. Right now, it is hidden
> off in text_en_splitting.
> > I think stemming should also be part of the default English field type.
> The whole point of the "example" schema is to show-off the best of
> Lucene/Solr.
> > I'm not quite ready to propose that English be the default language
> supported by the example schema, but I am 99.999% certain that we should
> focus it on European, Roman, Latin languages. Non-European languages are
> indeed important, and should probably have their own schema. text_general
> was a good idea, but in hindsight it appears to have not been such a great
> idea in light of the word-splitting problems I have highlighted above.
> > Maybe I would propose that text_general be left as is, but that we add
> text_default which is a copy of text_en (which would have WDF and stemming
> added) and fields use text_default as their type. That way, it would be
> clear what is going on and users could sensibly see what needs to happen if
> they wish to switch default languages.
> > After discussion settles, a revised final proposal will be composed. And
> some specific and non-controversial issues may be split into separate Jira
> issues.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] [Commented] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Reply via email to