Robert Muir created LUCENE-6400:
-----------------------------------
Summary: SynonymParser should encode 'expand' correctly.
Key: LUCENE-6400
URL: https://issues.apache.org/jira/browse/LUCENE-6400
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir
Today SolrSynonymParser encodes something like A, B, C with 'expand=true' like
this:
A -> A, B, C (includeOrig=false)
B -> B, A, C (includeOrig=false)
C -> C, A, B (includeOrig=false)
This gives kinda buggy output (synfilter sees it all as replacements, and makes
all the terms with type synonym, positionLength isnt supported, etc) and it
wastes space in the FST (includeOrig is just one bit).
Example with "spiderman, spider man" and analysis on 'spider man'
Trunk:
term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
You can see this is confusing, all the words have type SYNONYM, because spider
and man got deleted, and totally replaced by new terms (Which happen to have
the same text).
Patch:
term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]