[
https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869297#action_12869297
]
Michael McCandless commented on LUCENE-2470:
--------------------------------------------
This is a great idea!
It'd give us much more composability in the analysis pipeline, since
individual filters (Shingle, Stem) would be fully independent, ie not
aware that they are being invoked from the BranchingFilter.
I think we should allow the conditional to switch between
sub-pipelines? EG I could make a stage that detects proper names
(say)... and if the token is not a proper name, it'll run through a
LowercaseFilter then StopFilter, else it passes through. So the
conditional would switch between full sub-pipelines.
We should also allow for 1 -> many sub-pipelines, eg you conditionally
invoke an ngram filter. Or many -> may, eg you conditionally invoke a
shingle filters.
I think upgrading the analysis pipeline to write-once attr bindings
(LUCENE-2450) would make this BranchingFilter easier to implement.
With write-once bindings, there's full visibility on which attrs a
Stage writes to (changes). So this BranchingStage could easily
introspect to see which attrs its subs write to, invoke them as the
conditions require, and if none of the conditions apply, copy over the
necessary attrs.
> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
> Key: LUCENE-2470
> URL: https://issues.apache.org/jira/browse/LUCENE-2470
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis
> Affects Versions: 4.0
> Reporter: Steven Rowe
> Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to
> apply filter(s) to only part of an input stream's tokens, under
> user-specifiable conditions (e.g. when a given token attribute has a
> particular value) in a way that did not place that responsibility on
> individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same
> way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter
> only when the TypeAttribute=<CJK>, or if Robert's new
> ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when
> token length exceeds some threshold. For example, a user could configure an
> analyzer to stem only when CharTermAttribute length is greater than 4
> characters.
> One potential way to achieve this conditional branching facility is with a
> new kind of filter that can be configured with one or more following filters
> and condition(s) under which the filter should be engaged. This could be
> called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the
> current pipeline architecture, to have a single pipeline endpoint. A
> MergingFilter might be useful in its own right, e.g. to collect document data
> from multiple sources. Perhaps a conditional merging facility would be
> useful as well.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]