[jira] [Commented] (SOLR-10423) ShingleFilter causes overly restrictive queries to be produced

Steve Rowe (JIRA) Tue, 04 Apr 2017 14:46:01 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955912#comment-15955912
 ]


Steve Rowe commented on SOLR-10423:
-----------------------------------

I think the fix for this problem is to expose 
{{QueryBuilder.setEnableGraphQueries()}} on Solr field types, in the same way 
that the {{autoGeneratePhraseQueries}} option is now.

Since 6.5 is the first version of Solr that included the {{sow=false}} option, 
it wasn't possible to construct queries using ShingleFilter, because Solr's 
query parser always split on whitespace before performing analysis, one term at 
a time.

The following Lucene unit test (added to the queryparser module's 
{{TestQueryParser.java}}, after adding a test dependency on the analysis-common 
module), which calls {{QueryBuilder.setEnableGraphQueries(false);}}, succeeds 
for me.  When I change the test to call {{assertQueryEquals()}} (which doesn't 
disable graph queries, which are enabled by default), the test fails with this 
assertion error: {{Query /A B C/ yielded /(+A_B +B_C) A_B_C/, expecting  
/Synonym(A_B A_B_C) B_C/}}.

{code:java}
  public void testShinglesSplitOnWhitespace() throws Exception {
    Analyzer a = new Analyzer() {
      @Override protected TokenStreamComponents createComponents(String s) {
        Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, 
false);
        ShingleFilter tokenStream = new ShingleFilter(tokenizer, 2, 3);
        tokenStream.setTokenSeparator("_");
        tokenStream.setOutputUnigrams(false);
        return new TokenStreamComponents(tokenizer, tokenStream);
      }
    };
    boolean oldSplitOnWhitespace = splitOnWhitespace;
    splitOnWhitespace = false;
    assertQueryEqualsNoGraph("A B C", a, "Synonym(A_B A_B_C) B_C");
    splitOnWhitespace = oldSplitOnWhitespace;
  }

  public void assertQueryEqualsNoGraph(String query, Analyzer a, String result) 
throws Exception {
    QueryParser parser = getParser(a);
    parser.setEnableGraphQueries(false);
    Query q = parser.parse(query);
    String s = q.toString("field");
    if (!s.equals(result)) {
      fail("Query /" + query + "/ yielded /" + s + "/, expecting /" + result + 
"/");
    }
  }
{code}

> ShingleFilter causes overly restrictive queries to be produced
> --------------------------------------------------------------
>
>                 Key: SOLR-10423
>                 URL: https://issues.apache.org/jira/browse/SOLR-10423
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Steve Rowe
>
> When {{sow=false}} and {{ShingleFilter}} is included in the query analyzer, 
> {{QueryBuilder}} produces queries that inappropriately require sequential 
> terms.  E.g. the query "A B C" produces {{(+A_B +B_C) A_B_C}} when the query 
> analyzer includes {{<filter class="solr.ShingleFilterFactory" 
> maxShingleSize="3" outputUnigrams="false" tokenSeparator="_"/>}}.
> Aman Deep Singh reported this problem on the solr-user list. From 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201703.mbox/%3ccanegtx9bwbpwqc-cxieac7qsas7x2tgzovomy5ztiagco1p...@mail.gmail.com%3e]:
> {quote}
> I was trying to use the shingle filter but it was not creating the query as
> desirable.
> my schema is
> {noformat}
> <fieldType name="cust_shingle" class="solr.TextField" 
> positionIncrementGap="100">
>   <analyzer>
>     <tokenizer class="solr.StandardTokenizerFactory"/>
>     <filter class="solr.ShingleFilterFactory" outputUnigrams="false" 
> maxShingleSize="4"/>
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> <field name="nameShingle" type="cust_shingle" indexed="true" stored="true"/>
> {noformat}
> my solr query is
> {noformat}
> http://localhost:8983/solr/productCollection/select?
>  defType=edismax
> &debugQuery=true
> &q=one%20plus%20one%20four
> &qf=nameShingle
> &sow=false
> &wt=xml
> {noformat}
> and it was creating the parsed query as
> {noformat}
> <str name="parsedquery">
> (+(DisjunctionMaxQuery(((+nameShingle:one plus +nameShingle:plus one
> +nameShingle:one four))) DisjunctionMaxQuery(((+nameShingle:one plus
> +nameShingle:plus one four))) DisjunctionMaxQuery(((+nameShingle:one plus one 
> +nameShingle:one four))) DisjunctionMaxQuery((nameShingle:one plus one 
> four)))~1)/no_coord
> </str>
> <str name="parsedquery_toString">
> *+((((+nameShingle:one plus +nameShingle:plus one +nameShingle:one four))
> ((+nameShingle:one plus +nameShingle:plus one four)) ((+nameShingle:one
> plus one +nameShingle:one four)) (nameShingle:one plus one four))~1)*
> </str>
> {noformat}
> So ideally token creations is perfect but in the query it is using boolean + 
> operator which is causing the problem as if i have a document with name as 
> "one plus one" ,according to the shingles it has to matched as its token will 
> be  ("one plus","one plus one","plus one") .
> I have tried using the q.op and played around the mm also but nothing is
> giving me the correct response.
> Any idea how i can fetch that document even if the document is missing any
> token.
> My expected response will be getting the document "one plus one" even the 
> user query has any additional term like "one plus one two" and so on.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10423) ShingleFilter causes overly restrictive queries to be produced

Reply via email to