Re: SynonymGraphFilter followed by StopFilter

Andrea Gazzarini Thu, 26 Jul 2018 08:42:17 -0700

Hi Walter,

many thanks for the response and without any constraint at all, I wouldagree with you. From your message I clearly understand your experienceis greater than mine. My 2 cents inline below:

> Move the synonym filter to the index analyzer chain. That providesbetter performance and avoids some surprising relevance behavior. Withsynonyms at query time, you’ll see different idf for terms in thesynonym set, with the rare variant scoring higher. That is probably theopposite of what is expected.

Unfortunately moving the synonym filter to the index analyzer is not anoption: the project where I'm working on has a huge index and thesynonyms list is something that (at least in this stage) frequentlychanges; re-index everything from scratch each time a change occurs is abig problem. On the other side, the IDF issue you mention doesn'tproduce so many unwanted effect, at least until now. But I got thepoint, thanks for the hint.

> Also, phrase synonyms just don’t work at query time because the termsare parsed into individual tokens by the query parser, not the tokenizer.Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace= false + AutoGeneratePhraseQueries I get the synonym phrasing correctlyworking (see the first example in my email).

> Don’t use stop words. Just remove that line. Removing stop words is aperformance and space hack that was useful in the 1960’s, but causesproblems now. I’ve never used stop word removal and I started in searchwith Infoseek in 1996. Stop word removal is like a binary idf, ignoringcommon words. Since we have idf, we can give a lower score to commonwords and keep them in the index.

And this is, as I see, something which animated long discussions aroundusing / avoiding stopwords. I will check your suggestion, what it meansto apply that approach to my project, but in meantime I think, alsolooking at the JIRA Alan pointed in his answer, the issue is there, andit's real; I mean, it is something that it doesn't work as expected (myuse case, as far as I understand, is just an example because the thingis broader and it is related to the FilteredTokenFilter)


Thanks again,
Andrea

On 26/07/18 16:59, Walter Underwood wrote:

Move the synonym filter to the index analyzer chain. That providesbetter performance and avoids some surprising relevance behavior. Withsynonyms at query time, you’ll see different idf for terms in thesynonym set, with the rare variant scoring higher. That is probablythe opposite of what is expected.
Also, phrase synonyms just don’t work at query time because the termsare parsed into individual tokens by the query parser, not the tokenizer.
Don’t use stop words. Just remove that line. Removing stop words is aperformance and space hack that was useful in the 1960’s, but causesproblems now. I’ve never used stop word removal and I started insearch with Infoseek in 1996. Stop word removal is like a binary idf,ignoring common words. Since we have idf, we can give a lower score tocommon words and keep them in the index.
Do those two things and it should work as you expect.

wunder
Walter Underwood
[email protected] <mailto:[email protected]>
http://observer.wunderwood.org/  (my blog)
On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <[email protected]<mailto:[email protected]>> wrote:
Hi Alan, thanks for the response and thank you very much for the pointers


On 26/07/18 12:16, Alan Woodward wrote:
Hi Andrea,
This is a long-standing issue: seehttps://issues.apache.org/jira/browse/LUCENE-4065 andhttps://issues.apache.org/jira/browse/LUCENE-8250 for discussion. Idon’t think we’ve reached a consensus on how to fix it yet, but moreexamples are good.
Unfortunately I don’t think changing the StopFilter to ignoreSYNONYM tokens will work, because then you’ll generate queries thatalways fail - they’ll search for ‘of’ in the middle of the phrase,but ‘of’ never gets indexed because it’s removed by the StopFilterat index time.
- Alan
On 26 Jul 2018, at 08:04, Andrea Gazzarini <[email protected]<mailto:[email protected]>> wrote:
Hi,
I have the following field type definition:
<fieldtype name="text" class="solr.TextField"autoGeneratePhraseQueries="true">
     <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory"synonyms="synonyms.txt" ignoreCase="false" expand="true"/><filter class="solr.StopFilterFactory" words="stopwords.txt"ignoreCase="false"/>
     </analyzer>
</fieldtype>
Where synonyms and stopwords are defined as follows:

synonyms = out of warranty,oow
stopwords = of

Running the following query:

q=my tv went out *of* warranty something *of*

I get wrong results, with the following explain:
title:my title:tv title:went (title:oow *PhraseQuery(title:"out ?warranty something"))*
That is, the synonyms is correctly detected, I see the graphinformation are correctly reported in the positionLength, it seemsthey are wrongly interpreted by the QueryParser.I guess the reason is the "of" removal operated by the StopFilter,which
  * removes the "of" term within the phrase (I wouldn't want that)
  * creates a "hole" in the span defined by the "oow" term, which
    has been marked as a synonym with a positionLength = 3,
    therefore including the next available term (something).
I tried to change the StopFilter in order to ignore stopwords thatare marked as SYNONYM or that are part of a previous synonym span,and it works: it correctly produces the following query:
title:my title:tv title:went *(title:oow PhraseQuery(title:"out ofwarranty"))* title:something
So I'd like to ask your opinion about this. Am I missing something?Do you think it's better to open a JIRA issue? If the solution is agraph aware stop filter, do you think it's better to change theexisting filter or to subclass it?
Best,
Andrea

Re: SynonymGraphFilter followed by StopFilter

Reply via email to