[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

Steve Rowe (JIRA) Fri, 11 May 2018 18:55:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472873#comment-16472873
 ]


Steve Rowe commented on LUCENE-8273:
------------------------------------

I've belatedly reviewed the changes committed under this issue - awesome 
additions!

I noticed a few small things that could be improved, and in putting together a 
patch I found what I think is a bug/omission in the design: 
{{TermExclusionFilterFactory}} can't be used outside of {{CustomAnalyzer}} 
because {{ConditionalTokenFilterFactory.inform()}} expects 
{{setInnerFilters()}} to have already been called, but 
{{TermExclusionFilterFactory}} doesn't have the ability to do that. As a 
result, the protected word list can't be loaded. AFAICT this will prevent 
{{TermExclusionFilterFactory}} from being used in Solr, e.g.

My idea to address the problem is to allow {{TermExclusionFilterFactory}} to 
accept (but not require) as args wrapped filters and those filters' args, and 
then call {{setInnerFilters()}} from the ctor. I've added a factory test suite 
that appears to validate the idea. See javadocs in the patch for interface 
details.

Because of the way I produced it, the patch (forthcoming) includes several 
other changes, some of which may be controversial. I'd be happy to factor out 
any of these changes if you don't agree with all of them, [~romseygeek]. The 
other changes, in no particular order:
 * I think {{TermExclusionFilter}} should be renamed to {{ProtectedTermFilter}} 
(or similar) - forcing devs&users to deal with *both* "protected" and 
"excluded" seems pointless to me; it should be either one or the other 
everywhere.
 * {{ProtectedTermFilter}} ctor should {{require()}} the protected terms file 
param rather than {{get()}} it, since it is in fact required.
 * Added {{TestConditionalTokenFilter.testMultipleConditionalFilters()}}
 * The boolean sense of {{ConditionalTokenFilter.shouldFilter()}} is backward 
in the class javadocs: it says that shouldFilter:false will execute the wrapped 
filters, but AFAICT it's the opposite.
 * Some other minor javadoc improvements.

If the approach is okay, there is a little bit more work to do: some additional 
testing (more tests for graceful handling of bad input, and a Solr test), and 
adding Solr ref guide doc.  I haven't run precommit or all tests yet, so there 
may be some outstanding problems.

> Add a ConditionalTokenFilter
> ----------------------------
>
>                 Key: LUCENE-8273
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8273
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

Reply via email to