RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shai Erera Fri, 24 Jul 2015 00:56:38 -0700

Ok thanks Uwe, this makes sense now!

Shai
On Jul 24, 2015 9:55 AM, "Uwe Schindler" <[email protected]> wrote:


> In 4.10, SolrIndexAnalyzer extends DelegatingAnalyzerWrapper instead of
> AnalyzerWrapper.
>
>
>
> DelegatingAnalyzerWrapper has its own „ReuseStrategy“. The perField one
> here and is used as fallback only (for incompatible configurations, e.g.
> when one of the per-field configs wrap with a filter or charfilter – but
> this does not happen in Solr for fields). See the patch:
> https://issues.apache.org/jira/secure/attachment/12654117/LUCENE-5803.patch
>
>
>
> It is very important that you use the PER_FIELD one as fallback strategy,
> because otherwise it would break stuff like AnalysisRequestHandler (because
> this one wraps). The Analyzer works per field, so any unknown delegate must
> be cached per field.
>
>
>
> The idea of LUCENE-5803 is to also delegate the “caching”. If the
> SolrAnalyzer does not wrap components, it can also delegate the caching.
> The wrapper’s reuse strategy is then unused. The delegate, FieldType’s
> Analyzer, uses TokenizerChain as Analyzer, which is GLOBAL_REUSE, so each
> FieldType caches globally, no matter how many field instances..
>
>
>
> So all is fine, it is just 4.7 where this optimization is not used.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: [email protected]
>
>
>
> *From:* Shai Erera [mailto:[email protected]]
> *Sent:* Friday, July 24, 2015 8:39 AM
> *To:* [email protected]
> *Subject:* Re: Why do SolrIndex/QueryAnalyzers use
> PER_FIELD_REUSE_STRATEGY
>
>
>
> Thanks Shalin, but I reviewed the code in trunk, and it still passes
> PER_FIELD. I can double check but I'm pretty sure that's what I saw.
>
> Shai
>
> On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[email protected]>
> wrote:
>
> Uwe fixed this in 4.10 with LUCENE-5803. Now we use
> GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
> create field types per node instead of per core for more savings.
>
> On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[email protected]> wrote:
> > Hi
> >
> > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> > usage by IndexSchema. This Solr in particular has one collection with 64
> > shards (2 replicas, but 64 cores on one node). The schema has ~120
> fields,
> > ~20 of them are of the same field type (text_general) and is serving
> around
> > 700 concurrent users (peak), with a thread pool limit of 1000.
> >
> > Reducing the thread-pool size is something they've tried, but the load is
> > high and the server keeps up fine with the load, and a thread pool that
> > size.
> >
> > What surprised me is that they report obscene numbers they see in the
> heap:
> > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> > thought that a TokenStreamComponents can be (and is) reused for all
> fields
> > in a document. And so even if we hold a ThreadLocal per
> > TokenStreamComponents, we should see 1000 of them at the most - per
> > Analyzer. And as I said, the analyzed fields are of type text_general,
> and
> > the rest of the fields are numeric, DV, String, Bool etc. (aka
> > not-analyzed).
> >
> > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the
> heap:
> >
> > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but
> could
> > be they served less than 700 users when the heap dump was taken).
> >
> > And if each such instance holds a zzBuffer of size 8KB, this amounts to
> >7GB
> > of heap space!
> >
> > Per Analyzer's constructor (which takes ReuseStrategy):
> >
> >   /**
> >    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
> >    * <p>
> >    * NOTE: if you just want to reuse on a per-field basis, it's easier to
> >    * use a subclass of {@link AnalyzerWrapper} such as
> >    * <a
> > href="
> {@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html
> ">
> >    * PerFieldAnalyerWrapper</a> instead.
> >    */
> >
> > However, AnalyzerWrapper's documentation somewhat contradicts it (I
> think):
> >
> >   /**
> >    * Creates a new AnalyzerWrapper with the given reuse strategy.
> >    * <p>If you want to wrap a single delegate Analyzer you can probably
> >    * reuse its strategy when instantiating this subclass:
> >    * {@code super(delegate.getReuseStrategy());}.
> >    * <p>If you choose different analyzers per field, use
> >    * {@link #PER_FIELD_REUSE_STRATEGY}.
> >    * @see #getReuseStrategy()
> >    */
> >
> > Maybe it is correct for AW, but not for DelegatingAW?
> >
> > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> > since SolrIndexAnalyzer returns different Analyzers for different fields
> > (per their field-type). But all fields that share the same Analyzer
> instance
> > should be safe reusing its TokenStreamComponents, since we never process
> > fields in parallel?
> >
> > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer
> instances
> > for different fields), but it's the only piece of the puzzle that
> confuses
> > me, since I trust whoever wrote this class to understand this stuff
> better
> > than I do ...
> >
> > What do you think?
> >
> > Shai
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

RE: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Reply via email to