[
https://issues.apache.org/jira/browse/LUCENE-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426969#comment-16426969
]
Mike Sokolov commented on LUCENE-8240:
--------------------------------------
Yes, a custom SynonymFilter would do it, but the logic in SynonymGraphFilter
today is pretty complex, and to "wrap" it we would need to copy and take over
maintaining a fork of that class since it can't be subclassed. I like to try to
avoid forking code when I can – what if someone makes a nice enhancement to
that in the future?
Another thing I tried, using a SubFieldAttribute to indicate the sub-field, was
a SwitchTokenFilter that would pull from different upstream
SynonymGraphFilters, all sharing the same source TokenStream. I eventually got
this working, but it was tricky since eg you need to reset() all the components
only once, and you need to inject a subfield-switch token as a distinct
meta-event in the token stream with no characters or position change, in order
to give the SwitchTokenFilter a chance to draw the *next* token from the
correct stream.
{quote}knowing the text content and the analyzer is not enough to know how a
field got analyzed
{quote}
I'm not sure I understand this concern. When do we need to know this? Well it
is true that in our case we *do* care, not so much about which analysis was
used, but which sub-field some tokens belong to, because we use that
information for scoring. So we store a positional mapping to enable that, but
it isn't necessary to support the analysis.
> Support different analysis per field instance
> ---------------------------------------------
>
> Key: LUCENE-8240
> URL: https://issues.apache.org/jira/browse/LUCENE-8240
> Project: Lucene - Core
> Issue Type: Wish
> Components: modules/analysis
> Reporter: Mike Sokolov
> Priority: Major
> Attachments: SubFieldAnalyzer.java
>
>
> The simplest change for this would be to make
> TokenStreamComponents.setReader() public. Another alternative would be to
> provide a SubFieldAnalyzer along the lines of what is attached, although for
> reasons given below I think this implementation is a little hacky and would
> ideally be supported in a different way before making *that* part of a public
> Lucene API.
> Exposing this method would allow a third-party extension to access it in
> order to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer
> (attached, for reference) that applies different analysis to different
> instances of a field. This supports a big "catch-all" field that has
> different (index-time) text processing. The way we implement that is by
> creating a TokenStreamComponents that wraps separate per-subfield components
> and switches among them when setReader() is called.
> Why setReader()? This is the only part of the API where we can inject this
> notion of subfields. setReader() is called with a Reader for each field
> instance, and we supply a special Reader that identifies its subfield.
> This is a bit hacky – ideally subfields would be first-class citizens in the
> Analyzer API, so eg there would be methods like
> Analyzer.createComponents(String fieldName, String subFieldName), etc.
> However this seems like a pretty big change for an experimental feature, so
> it seems like an OK tradeoff to live with the Reader-per-subfield hack for
> now.
> Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package
> in order to call TokenStreamComponents.setReader (on a separate instance) and
> propitiate java's code-hiding rules, which is awkward.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]