[jira] [Commented] (LUCENE-6400) SynonymParser should encode 'expand' correctly.

Michael McCandless (JIRA) Thu, 16 Apr 2015 12:27:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498554#comment-14498554
 ]


Michael McCandless commented on LUCENE-6400:
--------------------------------------------

bq. Yeah, that problem is disappointing, but a difficult problem. Definitely 
one that needs to be fixed. I get the impression from Mike (who is the expert 
on it), that it requires changes to the tokenstream api so that it can be done 
safely.

Fixing SynFilter to be able to "make positions" is really important.

It's somewhat tricky but not impossible, because PosIncAtt + PosLengthAtt are 
sufficient for expressing any graph and changing any incoming graph to another 
graph (with enough buffering).  I don't think we need changes to TokenStream 
API, only to the SynFilter impl.

What makes fixing SynFilter tricky is noting when a new position was created 
and then fixing any syns that had "spanned" that new position to also increase 
their position lengths, I think?  And it may require more buffering than syn 
filter does now...

> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
>                 Key: LUCENE-6400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6400
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-6400.patch, LUCENE-6400.patch, LUCENE-6400.patch, 
> LUCENE-6400.patch, PositionLenghtAndType-unittests.patch, 
> unittests-expand-and-parse.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true' 
> like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and 
> makes all the terms with type synonym, positionLength isnt supported, etc) 
> and it wastes space in the FST (includeOrig is just one bit). 
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because 
> spider and man got deleted, and totally replaced by new terms (Which happen 
> to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6400) SynonymParser should encode 'expand' correctly.

Reply via email to