[
https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ian Ribas updated LUCENE-6400:
------------------------------
Attachment: PositionLenghtAndType-unittests.patch
I did some unit tests, to try and get into the code and saw a behavior that I
think is not right.
For the same example of "spiderman, spider man", an analysis on 'spiderman'
gives:
term=spiderman,positionIncrement=1,*positionLength=1*,type=word
term=spider,positionIncrement=1,positionLength=1,type=SYNONYM
term=man,positionIncrement=1,positionLength=1,type=SYNONYM
To be coherent, I thought it should be:
term=spiderman,positionIncrement=1,*positionLength=2*,type=word
term=spider,positionIncrement=1,positionLength=1,type=SYNONYM
term=man,positionIncrement=1,positionLength=1,type=SYNONYM
Things get even more complicated when the synonyms have even more different
word counts, such as this example (from the Elasticsearch documentation:
http://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html):
"usa,united states,u s a,united states of america"
The analysis of the longest synonym: 'united states of america', works fine,
but an analysis of a text containing a shorter one, such as 'the united states
is wealthy' still yields a sausage.
I attached a patch with the changes plus the unit tests that exemplify these
situations. The tests now pass, but the results I think are the correct ones
are commented just under the one's I think are wrong. To be used if useful, and
discarded if not, of course.
I'm not sure I'll be able to do it, but I'm looking into how to handle
positionLength to have a better graph.
> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
> Key: LUCENE-6400
> URL: https://issues.apache.org/jira/browse/LUCENE-6400
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-6400.patch, PositionLenghtAndType-unittests.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true'
> like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and
> makes all the terms with type synonym, positionLength isnt supported, etc)
> and it wastes space in the FST (includeOrig is just one bit).
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because
> spider and man got deleted, and totally replaced by new terms (Which happen
> to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]