Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Steve Rowe Wed, 07 Feb 2018 05:10:01 -0800

Thanks Webster,

I created https://issues.apache.org/jira/browse/SOLR-11955 to work on this.


--
Steve
www.lucidworks.com

> On Feb 6, 2018, at 2:47 PM, Webster Homer <webster.ho...@sial.com> wrote:
> 
> I noticed that in some of the current example schemas that are shipped with
> Solr, there is a fieldtype, text_en_splitting, that feeds the output
> of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if
> this isn't supported, the example should probably be updated or removed.
> 
> On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe <sar...@gmail.com> wrote:
> 
>> Hi Александр,
>> 
>>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>>> 
>>> There should be no problem with using them together.
>> 
>> I believe Shawn is wrong.
>> 
>> From <http://lucene.apache.org/core/7_2_0/analyzers-common/
>> org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:
>> 
>>> NOTE: this cannot consume an incoming graph; results will be undefined.
>> 
>> Unfortunately, the ref guide entry for Synonym Graph Filter <
>> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-
>> graph-filter> doesn’t include a warning about this, but it should, like
>> the warning on Word Delimiter Graph Filter <https://lucene.apache.org/
>> solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:
>> 
>>> Note: although this filter produces correct token graphs, it cannot
>> consume an input token graph correctly.
>> 
>> (I’ve just committed a change to the ref guide source to add this also on
>> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be
>> included in the ref guide for Solr 7.3.)
>> 
>> In short, the combination of the two filters is not supported, because
>> WDGF produces a token graph, which SGF cannot correctly interpret.
>> 
>> Other filters also have this issue, see e.g. <https://issues.apache.org/
>> jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some
>> attention recently, and hopefully it will inspire fixes elsewhere.
>> 
>> Patches welcome!
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>> 
>>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>>> 
>>> On 2/5/2018 3:55 AM, Александр Шестак wrote:
>>>> 
>>>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
>>>> and  WordDelimiterGraphFilterFactory. Can they be used together?
>>>> 
>>> 
>>> There should be no problem with using them together.  But it is always
>>> possible that the behavior will surprise you, while working 100% as
>>> designed.
>>> 
>>>> I have solr type configured in next way
>>>> 
>>>> <fieldtype name="fulltext_en" class="solr.TextField"
>>>> autoGeneratePhraseQueries="true">
>>>>  <analyzer type="index">
>>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
>>>>            generateWordParts="1" generateNumberParts="1"
>>>> splitOnNumerics="1"
>>>>            catenateWords="1" catenateNumbers="1" catenateAll="0"
>>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>>    <filter class="solr.FlattenGraphFilterFactory"/>
>>>>  </analyzer>
>>>>  <analyzer type="query">
>>>>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>    <filter class="solr.WordDelimiterGraphFilterFactory"
>>>>            generateWordParts="1" generateNumberParts="1"
>>>> splitOnNumerics="1"
>>>>            catenateWords="0" catenateNumbers="0" catenateAll="0"
>>>> preserveOriginal="1" protected="protwords_en.txt"/>
>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>    <filter class="solr.SynonymGraphFilterFactory"
>>>>            synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>>>>  </analyzer>
>>>> </fieldtype>
>>>> 
>>>> So on query time it uses SynonymGraphFilterFactory after
>>>> WordDelimiterGraphFilterFactory.
>>>> Synonyms are configured in next way:
>>>> b=>b,boron
>>>> 2=>ii,2
>>>> 
>>>> Query in solr analysis tool looks so. It is shown that terms after SGF
>>>> have positions 3 and 4. Is it correct? I thought that they should had
>>>> 1 and 2 positions.
>>>> 
>>> 
>>> What matters is the *relative* positions.  The exact position number
>>> doesn't matter much.  Something new that the Graph implementations use
>>> is the position length.  That feature is necessary for multi-term
>>> synonyms to function correctly in phrase queries.
>>> 
>>> In your analysis screenshot, WDGF creates three tokens.  The two tokens
>>> created by splitting the input are at positions 1 and 2, which I think
>>> is 100% as expected.  It also sets the positionLength of the first term
>>> to 2, probably because it has split that term into 2 additional terms.
>>> 
>>> Then the SGF takes those last two terms and expands them.  Each of the
>>> synonyms is at the same position as the original term, and the relative
>>> positions of the two synonym pairs have not changed -- the second one is
>>> still one higher than the first.  I think the reason that SGF moves the
>>> positions two higher is because the positionLength on the "b2" term is
>>> 2, previously set by WDGF.  Someone with more knowledge about the Graph
>>> implementations may have to speak up as to whether this behavior is
>> correct.
>>> 
>>> Because the relative positions of the split terms don't change when SGF
>>> runs, I think this is probably working as designed.
>>> 
>>> Thanks,
>>> Shawn
>> 
>> 
> 
> -- 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to 
> any other person. If you have received this transmission in error, please 
> notify the sender immediately and delete the message and any attachment 
> from your system. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not accept liability for any omissions or errors in this 
> message which may arise as a result of E-Mail-transmission or for damages 
> resulting from any unauthorized changes of the content of this message and 
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not guarantee that this message is free of viruses and does 
> not accept liability for any damages caused by any virus transmitted 
> therewith.
> 
> Click http://www.emdgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Reply via email to