[ 
https://issues.apache.org/jira/browse/LUCENE-5803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052513#comment-14052513
 ] 

Uwe Schindler commented on LUCENE-5803:
---------------------------------------

Samuel García (‏https://twitter.com/samuelgmartinez) said on Twitter:
bq. @kimchy @thetaph1 we have a solr server cluster with 150 cores and about 20 
indexed fields. We lost 1.5gb due to these zz_buffer tlocal.

This patch will improve this situation, but not as good as in Elasticsearch. 
The difference in Solr is: Solr has a complete separation of cores (even with 
different classloader). Each core has its own schema with own field types. 
Every field type has its own analyzer. Those are combined in a 
PerFieldAnalyzer-like wrapper. If Solr would allow to define "field types" 
globally (across cores), this could be shared. But with crrent Solr, each core 
gets its own zz_buffer tlocal. The improvement in Solr due to this patch is:
In the past we had a separate threadlocal *per field name*, because the 
AnalyzerWraper had a per-field-reuse strategy. With this patch we now have a 
global reuse strategy *per* FieldType. So the imporvement is: If you define a 
field type one time and reuse it for 20 fields, you have only one cached 
TokenStream, not 20. This is because we now delegate to the underlying Analyzer 
(the one from the field type), which has GLOBAL_REUSE_STRATEGY.

> Add another AnalyzerWrapper class that does not have its own cache, so 
> delegate-only wrappers don't create thread local resources several times
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5803
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5803
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5803.patch, LUCENE-5803.patch, LUCENE-5803.patch, 
> LUCENE-5803.patch, LUCENE-5803.patch
>
>
> This is a followup issue for the following Elasticsearch issue: 
> https://github.com/elasticsearch/elasticsearch/pull/6714
> Basically the problem is the following:
> - Elasticsearch has a pool of Analyzers that are used for analysis in several 
> indexes
> - Each index uses a different PerFieldAnalyzerWrapper
> PerFieldAnalyzerWrapper uses PER_FIELD_REUSE_STRATEGY. Because of this it 
> caches the tokenstreams for every field. If there are many fields, this are a 
> lot. In addition, the underlying analyzers may also cache tokenstreams and 
> other PerFieldAnalyzerWrappers do the same, although the delegate Analyzer 
> can always return the same components.
> We should add similar code to Elasticsearch's directly to Lucene: If the 
> delegating Analyzer just delegates per Field or just wraps CharFilters around 
> the Reader, there is no need to cache the TokenStreamComponents a second time 
> in the delegating Analyzers. This is only needed, if the delegating Analyzers 
> adds additional TokenFilters (like ShingleAnalyzerWrapper).
> We should name this new class DelegatingAnalyzerWrapper extends 
> AnalyzerWrapper. The wrapComponents method must be final, because we are not 
> allowed to add additional TokenFilters, but unlike ES, we don't need to 
> disallow wrapping with CharFilters.
> Internally this class uses a private ReuseStrategy that just delegates to the 
> underlying analyzer. It does not matter here if the strategy of the delegate 
> is global or per field, this is private to the delegate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to