[
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16475166#comment-16475166
]
Steve Rowe edited comment on LUCENE-8273 at 5/15/18 1:50 AM:
-------------------------------------------------------------
I stumbled on what looks like a {{ProtectedTermFilter}} bug when a wrapped
filter is a filtering token filter, and the content to be analyzed contains at
least one non-protected term prior to a protected term; in this case protection
fails:
{code:java|title=TestProtectedTermFilter.java}
public void testWrappedFilteringTokenFilter() throws IOException {
CharArraySet protectedTerms = new CharArraySet(5, true);
protectedTerms.add("foobar");
TokenStream stream = whitespaceMockTokenizer("foobar abc");
stream = new ProtectedTermFilter(protectedTerms, stream, in -> new
LengthFilter(in, 1, 4));
assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); //
succeeds
stream = whitespaceMockTokenizer("wuthering foobar abc");
stream = new ProtectedTermFilter(protectedTerms, stream, in -> new
LengthFilter(in, 1, 4));
assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); //
fails @ term 0: Actual: abc
}
{code}
I haven't yet figured out what the problem is. Alan, do you understand what's
happening here?
was (Author: steve_rowe):
I stumbled on what looks like a {{ProtectedTermFilter}} bug when a wrapped
filter is a filtering token filter, and the content to be analyzed contains at
least one non-protected term prior to a protected term; in this case protection
fails:
{code:java|title=TestProtectedTerm.java}
public void testWrappedFilteringTokenFilter() throws IOException {
CharArraySet protectedTerms = new CharArraySet(5, true);
protectedTerms.add("foobar");
TokenStream stream = whitespaceMockTokenizer("foobar abc");
stream = new ProtectedTermFilter(protectedTerms, stream, in -> new
LengthFilter(in, 1, 4));
assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); //
succeeds
stream = whitespaceMockTokenizer("wuthering foobar abc");
stream = new ProtectedTermFilter(protectedTerms, stream, in -> new
LengthFilter(in, 1, 4));
assertTokenStreamContents(stream, new String[]{ "foobar", "abc" }); //
fails @ term 0: Actual: abc
}
{code}
I haven't yet figured out what the problem is. Alan, do you understand what's
happening here?
> Add a ConditionalTokenFilter
> ----------------------------
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-part2.patch, LUCENE-8273.patch,
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch,
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter
> in such a way that it could optionally be bypassed based on the current state
> of the TokenStream. This could be used to, for example, only apply
> WordDelimiterFilter to terms that contain hyphens.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]