Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shai Erera Thu, 23 Jul 2015 14:56:00 -0700

Hi

I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
usage by IndexSchema. This Solr in particular has one collection with 64
shards (2 replicas, but 64 cores on one node). The schema has ~120 fields,
~20 of them are of the same field type (text_general) and is serving around
700 concurrent users (peak), with a thread pool limit of 1000.


Reducing the thread-pool size is something they've tried, but the load is
high and the server keeps up fine with the load, and a thread pool that
size.

What surprised me is that they report obscene numbers they see in the heap:
680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
thought that a TokenStreamComponents can be (and is) reused for all fields
in a document. And so even if we hold a ThreadLocal per
TokenStreamComponents, we should see 1000 of them at the most - per
Analyzer. And as I said, the analyzed fields are of type text_general, and
the rest of the fields are numeric, DV, String, Bool etc. (aka
not-analyzed).

Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy ==
PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap:

64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could
be they served less than 700 users when the heap dump was taken).

And if each such instance holds a zzBuffer of size 8KB, this amounts to
>7GB of heap space!

Per Analyzer's constructor (which takes ReuseStrategy):

  /**
   * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
   * <p>




*   * NOTE: if you just want to reuse on a per-field basis, it's easier
to   * use a subclass of {@link AnalyzerWrapper} such as    * <a
href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
* PerFieldAnalyerWrapper</a> instead.*   */

However, AnalyzerWrapper's documentation somewhat contradicts it (I think):

  /**
   * Creates a new AnalyzerWrapper with the given reuse strategy.
   * <p>If you want to wrap a single delegate Analyzer you can probably
   * reuse its strategy when instantiating this subclass:
   * {@code super(delegate.getReuseStrategy());}.

*   * <p>If you choose different analyzers per field, use   * {@link
#PER_FIELD_REUSE_STRATEGY}.*
   * @see #getReuseStrategy()
   */

Maybe it is correct for AW, but not for DelegatingAW?

>From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
since SolrIndexAnalyzer returns different Analyzers for different fields
(per their field-type). But all fields that share the same Analyzer
instance should be safe reusing its TokenStreamComponents, since we never
process fields in parallel?

To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances
for different fields), but it's the only piece of the puzzle that confuses
me, since I trust whoever wrote this class to understand this stuff better
than I do ...

What do you think?

Shai

Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Reply via email to