Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shalin Shekhar Mangar Thu, 23 Jul 2015 21:59:59 -0700

Uwe fixed this in 4.10 with LUCENE-5803. Now we use
GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
create field types per node instead of per core for more savings.


On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[email protected]> wrote:
> Hi
>
> I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
> usage by IndexSchema. This Solr in particular has one collection with 64
> shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
> ~20 of them are of the same field type (text_general) and is serving around
> 700 concurrent users (peak), with a thread pool limit of 1000.
>
> Reducing the thread-pool size is something they've tried, but the load is
> high and the server keeps up fine with the load, and a thread pool that
> size.
>
> What surprised me is that they report obscene numbers they see in the heap:
> 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
> coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
> thought that a TokenStreamComponents can be (and is) reused for all fields
> in a document. And so even if we hold a ThreadLocal per
> TokenStreamComponents, we should see 1000 of them at the most - per
> Analyzer. And as I said, the analyzed fields are of type text_general, and
> the rest of the fields are numeric, DV, String, Bool etc. (aka
> not-analyzed).
>
> Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
> DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
> SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
> PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:
>
> 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
> be they served less than 700 users when the heap dump was taken).
>
> And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB
> of heap space!
>
> Per Analyzer's constructor (which takes ReuseStrategy):
>
>   /**
>    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>    * <p>
>    * NOTE: if you just want to reuse on a per-field basis, it's easier to
>    * use a subclass of {@link AnalyzerWrapper} such as
>    * <a
> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
>    * PerFieldAnalyerWrapper</a> instead.
>    */
>
> However, AnalyzerWrapper's documentation somewhat contradicts it (I think):
>
>   /**
>    * Creates a new AnalyzerWrapper with the given reuse strategy.
>    * <p>If you want to wrap a single delegate Analyzer you can probably
>    * reuse its strategy when instantiating this subclass:
>    * {@code super(delegate.getReuseStrategy());}.
>    * <p>If you choose different analyzers per field, use
>    * {@link #PER_FIELD_REUSE_STRATEGY}.
>    * @see #getReuseStrategy()
>    */
>
> Maybe it is correct for AW, but not for DelegatingAW?
>
> From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
> since SolrIndexAnalyzer returns different Analyzers for different fields
> (per their field-type). But all fields that share the same Analyzer instance
> should be safe reusing its TokenStreamComponents, since we never process
> fields in parallel?
>
> To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
> PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
> for different fields), but it's the only piece of the puzzle that confuses
> me, since I trust whoever wrote this class to understand this stuff better
> than I do ...
>
> What do you think?
>
> Shai



-- 
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Reply via email to