Add conditional braching/merging to Lucene's analysis pipeline
--------------------------------------------------------------
Key: LUCENE-2470
URL: https://issues.apache.org/jira/browse/LUCENE-2470
Project: Lucene - Java
Issue Type: New Feature
Components: Analysis
Affects Versions: 4.0
Reporter: Steven Rowe
Priority: Minor
Captured from a #lucene brainstorming session with Robert Muir:
Lucene's analysis pipeline would be more flexible if it were possible to apply
filter(s) to only part of an input stream's tokens, under user-specifiable
conditions (e.g. when a given token attribute has a particular value) in a way
that did not place that responsibility on individual filters.
Two use cases:
# StandardAnalyzer could directly handle ideographic characters in the same way
as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only
when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
# Stemming might make sense for some stemmer/domain combinations only when
token length exceeds some threshold. For example, a user could configure an
analyzer to stem only when CharTermAttribute length is greater than 4
characters.
One potential way to achieve this conditional branching facility is with a new
kind of filter that can be configured with one or more following filters and
condition(s) under which the filter should be engaged. This could be called
BranchingFilter.
I think a MergingFilter, the inverse of BranchingFilter, is necessary in the
current pipeline architecture, to have a single pipeline endpoint. A
MergingFilter might be useful in its own right, e.g. to collect document data
from multiple sources. Perhaps a conditional merging facility would be useful
as well.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]