[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

Jim Ferenczi (JIRA) Fri, 23 Feb 2018 01:38:21 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374145#comment-16374145
 ]


Jim Ferenczi commented on SOLR-11968:
-------------------------------------

bq. Jim Ferenczi, sorry, based on feedback from Robert over on LUCENE-4065 (see 
my comments there), I no longer see how your "twd/the walking dead" example 
represents graph corruption. Can you say more about why you call it corruption?

[~steve_rowe] what I mean by corruption is that we need to "infer" the holes 
when we build the graph from the token stream. For some cases we are able to 
reconstruct the correct graph (see how TokenStreamToAutomaton does when it sees 
dead states) but there are cases where it is not possible. Here is another 
example:

{code:java|title=TestStopFilterFactory.java}

public void testLeadingStopwordSynonymGraph() throws Exception {
  SynonymMap.Builder builder = new SynonymMap.Builder(true);
  builder.add(new CharsRef("twd"), new CharsRef("the\u0000walking\u0000dead"), 
true);
  builder.add(new CharsRef("twd"), new CharsRef("the\u0000man"), false);
  final SynonymMap synonymMap = builder.build();
 
  Analyzer analyzer = new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
      MockTokenizer tokenizer = new MockTokenizer();
      TokenStream stream = new SynonymGraphFilter(tokenizer, synonymMap, true);
      stream = new StopFilter(stream, 
CharArraySet.copy(Collections.singleton("the")));
      return new TokenStreamComponents(tokenizer, stream);
    }

  };
 TokenStream tokenStream = analyzer.tokenStream("field", "twd");
 assertTokenStreamContents(tokenStream,
 new String[] \{ "twd", "walking", "dead", "man" },
         null, null,
         new int[] \{ 1,      1,             1,          1 }, // posinc
         new int[] \{ 4,      1,             2,          1 }, // poslen
 null);
}
{code}

In this case "walking" and "man" appears on the same path so the graph contains 
"twd", "walking dead" and "walking man".
The token stream is not corrupted but the graph is wrong and I don't see how we 
can "fix" it outside of the stop filter.

> Multi-words query time synonyms
> -------------------------------
>
>                 Key: SOLR-11968
>                 URL: https://issues.apache.org/jira/browse/SOLR-11968
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers, Schema and Analysis
>    Affects Versions: master (8.0), 6.6.2
>         Environment: Centos 7.x
>            Reporter: Dominique Béjean
>            Assignee: Steve Rowe
>            Priority: Major
>
> I am trying multi words query time synonyms with Solr 6.6.2 and 
> SynonymGraphFilterFactory filter as explain in this article
>  
> [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/]
>   
>  My field type is :
> {code:java}
> <fieldType name="textSyn" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ElisionFilterFactory" ignoreCase="true" 
>              articles="lang/contractions_fr.txt"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.ASCIIFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>        <filter class="solr.FrenchMinimalStemFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ElisionFilterFactory" ignoreCase="true" 
>              articles="lang/contractions_fr.txt"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"
>              ignoreCase="true" expand="true"/>
>        <filter class="solr.ASCIIFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>        <filter class="solr.FrenchMinimalStemFilterFactory"/>
>      </analyzer>
>    </fieldType>{code}
>  
>  synonyms.txt contains the line :
> {code:java}
> om, olympique de marseille{code}
>  
>  stopwords.txt contains the word 
> {code:java}
> de{code}
>  
>  The order of words in my query has an impact on the generated query in 
> edismax
> {code:java}
> q={!edismax qf='name_text_gp' v=$qq}
>  &sow=false
>  &qq=...{code}
> with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the 
> synonyms expansion. It is working as expected.
> {code:java}
> "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil 
> +name_text_gp:maillot) name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu 
> +name_text_gp:marseil +name_text_gp:maillot)))",{code}
> with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the 
> same generated query 
> {code:java}
> "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",
>  "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code}
> I don't understand these generated queries. The first one looks like the 
> synonym expansion is ignored, but the second one shows it is not ignored and 
> only the synonym term is used.
>   
>  When I test the analisys for the field type the synonyms are correctly 
> expanded for both expressions
> {code:java}
> om maillot  
>  maillot om
>  olympique de marseille maillot
>  maillot olympique de marseille{code}
> resulting outputs always include the following terms (obvioulsly not always 
> in the same order)
> {code:java}
> olympiqu om marseil maillot {code}
>  
>  So, i suspect an issue with edismax query parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11968) Multi-words query time synonyms

Reply via email to