[
https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498554#comment-14498554
]
Michael McCandless commented on LUCENE-6400:
--------------------------------------------
bq. Yeah, that problem is disappointing, but a difficult problem. Definitely
one that needs to be fixed. I get the impression from Mike (who is the expert
on it), that it requires changes to the tokenstream api so that it can be done
safely.
Fixing SynFilter to be able to "make positions" is really important.
It's somewhat tricky but not impossible, because PosIncAtt + PosLengthAtt are
sufficient for expressing any graph and changing any incoming graph to another
graph (with enough buffering). I don't think we need changes to TokenStream
API, only to the SynFilter impl.
What makes fixing SynFilter tricky is noting when a new position was created
and then fixing any syns that had "spanned" that new position to also increase
their position lengths, I think? And it may require more buffering than syn
filter does now...
> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
> Key: LUCENE-6400
> URL: https://issues.apache.org/jira/browse/LUCENE-6400
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-6400.patch, LUCENE-6400.patch, LUCENE-6400.patch,
> LUCENE-6400.patch, PositionLenghtAndType-unittests.patch,
> unittests-expand-and-parse.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true'
> like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and
> makes all the terms with type synonym, positionLength isnt supported, etc)
> and it wastes space in the FST (includeOrig is just one bit).
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because
> spider and man got deleted, and totally replaced by new terms (Which happen
> to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]