[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

Steve Rowe (JIRA) Wed, 21 Feb 2018 13:33:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372046#comment-16372046
 ]


Steve Rowe commented on LUCENE-4065:
------------------------------------

[~rcmuir] commented over on 
[https://issues.apache.org/jira/browse/SOLR-11968?focusedCommentId=16370916&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16370916|SOLR-11968]
 about this issue:

{quote}
I think the issue is still valid, its a little more complex now because of 
positionLength (means more buffering when you see posLength > 1, because you'll 
need to adjust if you remove something in its path), but the idea is the same: 
give the user a choice between "insert mode" and "replace mode". But this new 
"insert mode" would actually work correctly, correcting posLengths before and 
posIncs after as appropriate. similar to how your editor might have to 
recompute some line breaks/word wrapping and so on.

If you have baseball (length=2), base(length=1), ball(length=1), and you delete 
"base" in this case, you need to change baseball's length to 1 before you omit 
it, because you deleted base. Thats the "buffering before" that would be 
required for posLength. And you still need the same buffering described on the 
issue for posInc=0 that might occur after the fact, so you don't wrongly 
transfer synonyms to different words entirely.

It would be slower than "replace mode" that we have today, but only because of 
the buffering, and I think its pretty contained, but I haven't fully thought it 
thru or tried to write any code.
{quote}

 +1, though I find the nomenclature confusing; in your proposed "insert mode", 
token deletions would not leave any trace of the deleted tokens -- in posinc 
and poslen -- right?  (I get that you mean "insert mode" and "replace mode" as 
a metaphoric for editor operations.)  Isn't the issue just whether to leave 
gaps (as indicated by posinc and poslen) where deleted tokens were?

> FilteringTokenFilter should never corrupt the tokenstream graph
> ---------------------------------------------------------------
>
>                 Key: LUCENE-4065
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4065
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable 
> position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down 
> the stream.
> If its false: it does neither, which can totally mess up the tokenstream 
> graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps 
> because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an 
> enum), where you still get a legit tokenstream and dont totally screw it up, 
> but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to 
> not begin the tokenstream with posinc=0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

Reply via email to