[jira] [Comment Edited] (LUCENE-8273) Add a ConditionalTokenFilter

Steve Rowe (JIRA) Mon, 14 May 2018 18:51:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475166#comment-16475166
 ]


Steve Rowe edited comment on LUCENE-8273 at 5/15/18 1:50 AM:
-------------------------------------------------------------

I stumbled on what looks like a {{ProtectedTermFilter}} bug when a wrapped 
filter is a filtering token filter, and the content to be analyzed contains at 
least one non-protected term prior to a protected term; in this case protection 
fails:

{code:java|title=TestProtectedTermFilter.java}
  public void testWrappedFilteringTokenFilter() throws IOException {
    CharArraySet protectedTerms = new CharArraySet(5, true);
    protectedTerms.add("foobar");
    TokenStream stream = whitespaceMockTokenizer("foobar abc");
    stream = new ProtectedTermFilter(protectedTerms, stream, in -> new 
LengthFilter(in, 1, 4));
    assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // 
succeeds

    stream = whitespaceMockTokenizer("wuthering foobar abc");
    stream = new ProtectedTermFilter(protectedTerms, stream, in -> new 
LengthFilter(in, 1, 4));
    assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // 
fails @ term 0: Actual: abc
  }
{code}

I haven't yet figured out what the problem is.  Alan, do you understand what's 
happening here?


was (Author: steve_rowe):
I stumbled on what looks like a {{ProtectedTermFilter}} bug when a wrapped 
filter is a filtering token filter, and the content to be analyzed contains at 
least one non-protected term prior to a protected term; in this case protection 
fails:

{code:java|title=TestProtectedTerm.java}
  public void testWrappedFilteringTokenFilter() throws IOException {
    CharArraySet protectedTerms = new CharArraySet(5, true);
    protectedTerms.add("foobar");
    TokenStream stream = whitespaceMockTokenizer("foobar abc");
    stream = new ProtectedTermFilter(protectedTerms, stream, in -> new 
LengthFilter(in, 1, 4));
    assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // 
succeeds

    stream = whitespaceMockTokenizer("wuthering foobar abc");
    stream = new ProtectedTermFilter(protectedTerms, stream, in -> new 
LengthFilter(in, 1, 4));
    assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); // 
fails @ term 0: Actual: abc
  }
{code}

I haven't yet figured out what the problem is.  Alan, do you understand what's 
happening here?

> Add a ConditionalTokenFilter
> ----------------------------
>
>                 Key: LUCENE-8273
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8273
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8273) Add a ConditionalTokenFilter

Reply via email to