Mike Sokolov created LUCENE-8240:
------------------------------------
Summary: Support different analysis per field instance
Key: LUCENE-8240
URL: https://issues.apache.org/jira/browse/LUCENE-8240
Project: Lucene - Core
Issue Type: Wish
Components: modules/analysis
Reporter: Mike Sokolov
Attachments: SubFieldAnalyzer.java
The simplest change for this would be to make TokenStreamComponents.setReader()
public. Another alternative would be to provide a SubFieldAnalyzer along the
lines of what is attached, although for reasons given below I think this
implementation is a little hacky and would ideally be supported in a different
way before making *that* part of a public Lucene API.
Exposing this method would allow a third-party extension to access it in order
to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer (attached, for
reference) that applies different analysis to different instances of a field.
This supports a big "catch-all" field that has different (index-time) text
processing. The way we implement that is by creating a TokenStreamComponents
that wraps separate per-subfield components and switches among them when
setReader() is called.
Why setReader()? This is the only part of the API where we can inject this
notion of subfields. setReader() is called with a Reader for each field
instance, and we supply a special Reader that identifies its subfield.
This is a bit hacky – ideally subfields would be first-class citizens in the
Analyzer API, so eg there would be methods like
Analyzer.createComponents(String fieldName, String subFieldName), etc. However
this seems like a pretty big change for an experimental feature, so it seems
like an OK tradeoff to live with the Reader-per-subfield hack for now.
Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package in
order to call TokenStreamComponents.setReader (on a separate instance) and
propitiate java's code-hiding rules, which is awkward.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]