[jira] [Commented] (SOLR-3085) Fix the dismax/edismax stopwords mm issue

JIRA Fri, 20 Dec 2013 05:07:19 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853940#comment-13853940
 ]


Jan Høydahl commented on SOLR-3085:
-----------------------------------

bq. Environments without stopwords still have a problem with mm. Consider your 
q=A horse in a stable. With mm=2 we get all kinds of documents, usually all 
documents in the corpus (in and a). Ideally this or another parameter would 
only require horse and stable.

The mm.autoRelax param is designed to tackle one of the most common situation 
where your qf includes a bunch of "text" fields with stopword removal plus one 
or more "string" fields like "id" or "tags" etc. Take the example of {{qf=title 
body tags}} where title and body removes stopwords but tags does not. This 
would translate to something like

{code}
(DMQ(tags:a) DMQ(title:horse | body:horse | tags:horse) DMQ(tags:in) 
DMQ(tags:a) DMQ(title:stable | body:stable | tags:stable))~5
{code}

Very often in these cases the "tags" field does not contain free-text, so 
tags:a, tags:in would not match, and we always get 0 hits -- thus mm=2 would 
help here.

But for cases where you query multiple english analyzed text fields with 
different stopword lists, relaxation of mm is not the cure. The cure is rather 
to add the same stopword handling to all those text fieldTypes.

Clearly mm.autoRelax is not a complete solution for all mm issues. For other 
cases we may need other cures. One idea I thought of the other day is a param 
{{mergeStopwords=true}}, which modifies the analysis chain for each field in 
{{qf}} to include all StopFilters on the "query" analysis of each field. I.e. 
if my field A has {{stopwords="a.txt"}} and field B has {{stopwords="b.txt"}}, 
then edismax would add those two stopword filters in a row for both fields, 
much the same way that edismax removes the StopFilter when doing smart stopword 
handling.

> Fix the dismax/edismax stopwords mm issue
> -----------------------------------------
>
>                 Key: SOLR-3085
>                 URL: https://issues.apache.org/jira/browse/SOLR-3085
>             Project: Solr
>          Issue Type: Bug
>          Components: query parsers
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>              Labels: MinimumShouldMatch, dismax, edismax, stopwords
>             Fix For: 5.0, 4.7
>
>         Attachments: SOLR-3085.patch, SOLR-3085.patch, SOLR-3085.patch
>
>
> As discussed here http://search-lucene.com/m/Wr7iz1a95jx and here 
> http://search-lucene.com/m/Yne042qEyCq1 and here 
> http://search-lucene.com/m/RfAp82nSsla DisMax has an issue with stopwords if 
> not all fields used in QF have exactly same stopword lists.
> Typical solution is to not use stopwords or harmonize stopword lists across 
> all fields in your QF, or relax the MM to a lower percentag. Sometimes these 
> are not acceptable workarounds, and we should find a better solution.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3085) Fix the dismax/edismax stopwords mm issue

Reply via email to