[ https://issues.apache.org/jira/browse/LUCENE-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973396#comment-16973396 ]
Alan Woodward commented on LUCENE-9030: --------------------------------------- Thanks for opening this fix, [~cbuescher] - I'm just running precommit now and will merge it in once that check passes. > Solr- and WordnetSynonymParser behaviour differs > ------------------------------------------------ > > Key: LUCENE-9030 > URL: https://issues.apache.org/jira/browse/LUCENE-9030 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 8.2 > Reporter: Christoph Büscher > Assignee: Alan Woodward > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Equivalent synonyms are showing up with different token types and ordering > depending on whether the Solr format or the Wordnet format is used. A synonym > set like > "woods, wood, forest" in Solr format leads to the following token stream > (term and type) when analyzing the term "forest": > "forest"/word, "woods"/SYNONYM, "wood" /SYNONYM > > The following set in Wordnet format should give the same output (all terms > are in the same synset), however all tokens are of type SYNONYM here and the > original input token "forest" isn't the first one: > synonyms.txt: > {code:java} > s(100000001,1,'woods',n,1,0) > s(100000001,2,'wood',n,1,0) > s(100000001,3,'forest',n,1,0){code} > Token stream (term/type) when an > woods"/SYNONYM, "wood" /SYNONYM, "forest"/SYNONYM > I don't think this is intentional and is confusing (especially because the > "original" input token type gets lost). I saw that the way the synsets are > added to the SynonymMap in the respective parsers differes and have a PR that > changes this. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org