[ 
https://issues.apache.org/jira/browse/LUCENE-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973460#comment-16973460
 ] 

ASF subversion and git services commented on LUCENE-9030:
---------------------------------------------------------

Commit eeea9fe2c7447a9f748d0881712c19328b21621c in lucene-solr's branch 
refs/heads/branch_8x from Christoph Büscher
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=eeea9fe ]

LUCENE-9030: Fix different Solr- and WordnetSynonymParser behaviour (#981)

This fixes an issue where sets of equivalent synonyms in the Wordnet format are
parsed and added to the SynonymMap in a way that leads to the original input
token not being typed as "word" but as SYNONYM instead. Also the original token
doesn't appear first in the token stream output, which is the case for
equivalent solr formatted synonym files.
Currently the WordnetSynonymParser adds all combinations of input/output pairs
of a synset entry into the synonym map, while the SolrSynonymParser excludes
those where input and output term are the same. This change adds the same
behaviour to WordnetSynonymParser and adds tests that show the two formats are
outputting the same token order and types now.


> Solr- and WordnetSynonymParser behaviour differs
> ------------------------------------------------
>
>                 Key: LUCENE-9030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9030
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 8.2
>            Reporter: Christoph Büscher
>            Assignee: Alan Woodward
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Equivalent synonyms are showing up with different token types and ordering 
> depending on whether the Solr format or the Wordnet format is used. A synonym 
> set like
> "woods, wood, forest" in Solr format leads to the following token stream 
> (term and type) when analyzing the term "forest":  
> "forest"/word, "woods"/SYNONYM, "wood" /SYNONYM
>  
> The following set in Wordnet format should give the same output (all terms 
> are in the same synset), however all tokens are of type SYNONYM here and the 
> original input token "forest" isn't the first one:
> synonyms.txt:
> {code:java}
> s(100000001,1,'woods',n,1,0)
> s(100000001,2,'wood',n,1,0)
> s(100000001,3,'forest',n,1,0){code}
> Token stream (term/type) when an
> woods"/SYNONYM, "wood" /SYNONYM, "forest"/SYNONYM
> I don't think this is intentional and is confusing (especially because the 
> "original" input token type gets lost). I saw that the way the synsets are 
> added to the SynonymMap in the respective parsers differes and have a PR that 
> changes this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to