[ 
https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6400:
---------------------------------------
    Attachment: LUCENE-6400.patch

OK a bit more simplifiying: I don't create outputs in the 2a case, I moved 
inputs/outputs decls down into where they are used, and I just pass true / 
false for expand since we are already in the if clauses...

> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
>                 Key: LUCENE-6400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6400
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-6400.patch, LUCENE-6400.patch, LUCENE-6400.patch, 
> LUCENE-6400.patch, PositionLenghtAndType-unittests.patch, 
> unittests-expand-and-parse.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true' 
> like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and 
> makes all the terms with type synonym, positionLength isnt supported, etc) 
> and it wastes space in the FST (includeOrig is just one bit). 
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because 
> spider and man got deleted, and totally replaced by new terms (Which happen 
> to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to