[ 
https://issues.apache.org/jira/browse/LUCENE-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426854#comment-16426854
 ] 

Mike Sokolov edited comment on LUCENE-8240 at 4/5/18 12:48 PM:
---------------------------------------------------------------

Right, you can use this to distinguish index-time differences only. In our case 
it's used to apply different synonyms to different subfields, and then we don't 
apply synonyms at query time. There are other possible uses I think; eg you 
could apply language-specific tokenization at index time to different 
subfields, but you might not know the intended language at query time, so you 
have to use a more general tokenizer. I don't know how useful that would be in 
practice - haven't tried it; you'd have different expectations of the user, at 
query time, than your documents.  Maybe a better example would be WordDelimiter 
– you might want to apply token splitting and recombination on number parts to 
a subfield that is a part number, but not to other subfields, and then at query 
time, you could do only the splitting – it is already asymmetric, usually.


was (Author: sokolov):
Right, you can use this to distinguish index-time differences only. In our case 
it's used to apply different synonyms to different subfields, and then we don't 
apply synonyms at query time. There are other possible uses I think; eg you 
could apply language-specific tokenization at index time to different 
subfields, but you might not know the intended language at query time, so you 
have to use a more general tokenizer.

> Support different analysis per field instance
> ---------------------------------------------
>
>                 Key: LUCENE-8240
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8240
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: modules/analysis
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: SubFieldAnalyzer.java
>
>
> The simplest change for this would be to make 
> TokenStreamComponents.setReader() public. Another alternative would be to 
> provide a SubFieldAnalyzer along the lines of what is attached, although for 
> reasons given below I think this implementation is a little hacky and would 
> ideally be supported in a different way before making *that* part of a public 
> Lucene API.
> Exposing this method would allow a third-party extension to access it in 
> order to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer 
> (attached, for reference) that applies different analysis to different 
> instances of a field. This supports a big "catch-all" field that has 
> different (index-time) text processing. The way we implement that is by 
> creating a TokenStreamComponents that wraps separate per-subfield components 
> and switches among them when setReader() is called.
> Why setReader()? This is the only part of the API where we can inject this 
> notion of subfields. setReader() is called with a Reader for each field 
> instance, and we supply a special Reader that identifies its subfield.
> This is a bit hacky – ideally subfields would be first-class citizens in the 
> Analyzer API, so eg there would be methods like 
> Analyzer.createComponents(String fieldName, String subFieldName), etc. 
> However this seems like a pretty big change for an experimental feature, so 
> it seems like an OK tradeoff to live with the Reader-per-subfield hack for 
> now.
> Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package 
> in order to call TokenStreamComponents.setReader (on a separate instance) and 
> propitiate java's code-hiding rules, which is awkward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to