Mike Sokolov created LUCENE-8240:
------------------------------------

             Summary: Support different analysis per field instance
                 Key: LUCENE-8240
                 URL: https://issues.apache.org/jira/browse/LUCENE-8240
             Project: Lucene - Core
          Issue Type: Wish
          Components: modules/analysis
            Reporter: Mike Sokolov
         Attachments: SubFieldAnalyzer.java

The simplest change for this would be to make TokenStreamComponents.setReader() 
public. Another alternative would be to provide a SubFieldAnalyzer along the 
lines of what is attached, although for reasons given below I think this 
implementation is a little hacky and would ideally be supported in a different 
way before making *that* part of a public Lucene API.

Exposing this method would allow a third-party extension to access it in order 
to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer (attached, for 
reference) that applies different analysis to different instances of a field. 
This supports a big "catch-all" field that has different (index-time) text 
processing. The way we implement that is by creating a TokenStreamComponents 
that wraps separate per-subfield components and switches among them when 
setReader() is called.

Why setReader()? This is the only part of the API where we can inject this 
notion of subfields. setReader() is called with a Reader for each field 
instance, and we supply a special Reader that identifies its subfield.

This is a bit hacky – ideally subfields would be first-class citizens in the 
Analyzer API, so eg there would be methods like 
Analyzer.createComponents(String fieldName, String subFieldName), etc. However 
this seems like a pretty big change for an experimental feature, so it seems 
like an OK tradeoff to live with the Reader-per-subfield hack for now.

Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package in 
order to call TokenStreamComponents.setReader (on a separate instance) and 
propitiate java's code-hiding rules, which is awkward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to