[jira] [Commented] (LUCENE-8240) Make TokenStreamComponents.setReader public

Mike Sokolov (JIRA) Fri, 06 Apr 2018 12:53:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428854#comment-16428854
 ]


Mike Sokolov commented on LUCENE-8240:
--------------------------------------

Well, I don't have much more to say, but perhaps this background from our use 
case will sway you :) We did try breaking up our large catchall field into 
separate fields, since it is more natural for Lucene than having these 
sub-fields. However we have so many of them (100s) that the performance of our 
queries was poor due to the zillions of term queries we had to generate, and in 
the end smooshing all these little fields together into one big one, with this 
switchable analyzer ended up being the best tradeoff.

> Make TokenStreamComponents.setReader public
> -------------------------------------------
>
>                 Key: LUCENE-8240
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8240
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: modules/analysis
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: SubFieldAnalyzer.java
>
>
> The simplest change for this would be to make 
> TokenStreamComponents.setReader() public. Another alternative would be to 
> provide a SubFieldAnalyzer along the lines of what is attached, although for 
> reasons given below I think this implementation is a little hacky and would 
> ideally be supported in a different way before making *that* part of a public 
> Lucene API.
> Exposing this method would allow a third-party extension to access it in 
> order to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer 
> (attached, for reference) that applies different analysis to different 
> instances of a field. This supports a big "catch-all" field that has 
> different (index-time) text processing. The way we implement that is by 
> creating a TokenStreamComponents that wraps separate per-subfield components 
> and switches among them when setReader() is called.
> Why setReader()? This is the only part of the API where we can inject this 
> notion of subfields. setReader() is called with a Reader for each field 
> instance, and we supply a special Reader that identifies its subfield.
> This is a bit hacky – ideally subfields would be first-class citizens in the 
> Analyzer API, so eg there would be methods like 
> Analyzer.createComponents(String fieldName, String subFieldName), etc. 
> However this seems like a pretty big change for an experimental feature, so 
> it seems like an OK tradeoff to live with the Reader-per-subfield hack for 
> now.
> Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package 
> in order to call TokenStreamComponents.setReader (on a separate instance) and 
> propitiate java's code-hiding rules, which is awkward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8240) Make TokenStreamComponents.setReader public

Reply via email to