[ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373727#comment-16373727
 ] 

Robert Muir commented on LUCENE-4065:
-------------------------------------

Well, i think thats a separate, "new" issue, related to positionLength. Keep in 
mind this JIRA issue was open before positionLength even existed at all. i 
think its totally separate from the whole idea of giving the user a gaps option.

This is what the thing looks like before stopfilter sees it. When you delete 
"the", it just transfers the 1 to twd. It doesn't currently look at 
positionlength at all.

{noformat}
SynonymGraphFilter->term=the,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1
SynonymGraphFilter->term=twd,positionIncrement=0,positionLength=3,type=word,termFrequency=1
SynonymGraphFilter->term=walking,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1
SynonymGraphFilter->term=dead,positionIncrement=1,positionLength=1,type=SYNONYM,termFrequency=1
{noformat}

To be honest, its unclear if stopfilter is really the culprit. Its definitely 
funky the way that SynonymGraphFilter makes the original word "twd" a "synonym" 
(posInc=0)... i think if it didn't do that, you wouldn't have that problem in 
this case. But i don't know if its a general solution to your problem.





> FilteringTokenFilter should never corrupt the tokenstream graph
> ---------------------------------------------------------------
>
>                 Key: LUCENE-4065
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4065
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to