[ 
https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486205#comment-14486205
 ] 

Robert Muir commented on LUCENE-6400:
-------------------------------------

Thanks Ian. What you see is a limitation of synonymfilter (unrelated to this 
parser). synonymfilter doesn't "introduce additional positions" except for a 
trailer at the end as a special case. Otherwise it "sausages" by interleaving 
phrases together. To change this is much more complicated. 

So your "spiderman" case will not behave correctly, but its unrelated to my 
patch here. The parser does the right thing... 

> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
>                 Key: LUCENE-6400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6400
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-6400.patch, PositionLenghtAndType-unittests.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true' 
> like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and 
> makes all the terms with type synonym, positionLength isnt supported, etc) 
> and it wastes space in the FST (includeOrig is just one bit). 
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because 
> spider and man got deleted, and totally replaced by new terms (Which happen 
> to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to