RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Uwe Schindler Thu, 23 Jul 2015 23:56:18 -0700

In 4.10, SolrIndexAnalyzer extends DelegatingAnalyzerWrapper instead of 
AnalyzerWrapper.


 

DelegatingAnalyzerWrapper has its own „ReuseStrategy“. The perField one here 
and is used as fallback only (for incompatible configurations, e.g. when one of 
the per-field configs wrap with a filter or charfilter – but this does not 
happen in Solr for fields). See the patch: 
https://issues.apache.org/jira/secure/attachment/12654117/LUCENE-5803.patch

 

It is very important that you use the PER_FIELD one as fallback strategy, 
because otherwise it would break stuff like AnalysisRequestHandler (because 
this one wraps). The Analyzer works per field, so any unknown delegate must be 
cached per field. 

 

The idea of LUCENE-5803 is to also delegate the “caching”. If the SolrAnalyzer 
does not wrap components, it can also delegate the caching. The wrapper’s reuse 
strategy is then unused. The delegate, FieldType’s Analyzer, uses 
TokenizerChain as Analyzer, which is GLOBAL_REUSE, so each FieldType caches 
globally, no matter how many field instances..

 

So all is fine, it is just 4.7 where this optimization is not used.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: [email protected]

 

From: Shai Erera [mailto:[email protected]] 
Sent: Friday, July 24, 2015 8:39 AM
To: [email protected]
Subject: Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

 

Thanks Shalin, but I reviewed the code in trunk, and it still passes PER_FIELD. 
I can double check but I'm pretty sure that's what I saw. 

Shai

On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[email protected]> wrote:

Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.

On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[email protected]> wrote:
> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html
>  
> <mailto:%7b@docRoot%7d/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html>
>  ">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Reply via email to