Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Shai Erera Thu, 23 Jul 2015 23:44:01 -0700

What I also want to double check is, after I sent the email I thought about
it some more, and I think the PER_FIELD is passed as a fallback strategy,
but otherwise it uses the wrapped analyzer's strategy. Maybe in 4.7 before
Uwe fixed things some Analyzers still returned PER_FIELD, but now they
don't anymore. I will double check that too.


Shai
On Jul 24, 2015 9:39 AM, "Shai Erera" <[email protected]> wrote:

> Thanks Shalin, but I reviewed the code in trunk, and it still passes
> PER_FIELD. I can double check but I'm pretty sure that's what I saw.
>
> Shai
> On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[email protected]>
> wrote:
>
>> Uwe fixed this in 4.10 with LUCENE-5803. Now we use
>> GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to
>> create field types per node instead of per core for more savings.
>>
>> On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[email protected]> wrote:
>> > Hi
>> >
>> > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap
>> > usage by IndexSchema. This Solr in particular has one collection with 64
>> > shards (2 replicas, but 64 cores on one node). The schema has ~120
>> fields,
>> > ~20 of them are of the same field type (text_general) and is serving
>> around
>> > 700 concurrent users (peak), with a thread pool limit of 1000.
>> >
>> > Reducing the thread-pool size is something they've tried, but the load
>> is
>> > high and the server keeps up fine with the load, and a thread pool that
>> > size.
>> >
>> > What surprised me is that they report obscene numbers they see in the
>> heap:
>> > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB
>> > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I
>> > thought that a TokenStreamComponents can be (and is) reused for all
>> fields
>> > in a document. And so even if we hold a ThreadLocal per
>> > TokenStreamComponents, we should see 1000 of them at the most - per
>> > Analyzer. And as I said, the analyzed fields are of type text_general,
>> and
>> > the rest of the fields are numeric, DV, String, Bool etc. (aka
>> > not-analyzed).
>> >
>> > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends
>> > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends
>> > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy
>> ==
>> > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the
>> heap:
>> >
>> > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but
>> could
>> > be they served less than 700 users when the heap dump was taken).
>> >
>> > And if each such instance holds a zzBuffer of size 8KB, this amounts to
>> >7GB
>> > of heap space!
>> >
>> > Per Analyzer's constructor (which takes ReuseStrategy):
>> >
>> >   /**
>> >    * Expert: create a new Analyzer with a custom {@link ReuseStrategy}.
>> >    * <p>
>> >    * NOTE: if you just want to reuse on a per-field basis, it's easier
>> to
>> >    * use a subclass of {@link AnalyzerWrapper} such as
>> >    * <a
>> >
>> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html">
>> >    * PerFieldAnalyerWrapper</a> instead.
>> >    */
>> >
>> > However, AnalyzerWrapper's documentation somewhat contradicts it (I
>> think):
>> >
>> >   /**
>> >    * Creates a new AnalyzerWrapper with the given reuse strategy.
>> >    * <p>If you want to wrap a single delegate Analyzer you can probably
>> >    * reuse its strategy when instantiating this subclass:
>> >    * {@code super(delegate.getReuseStrategy());}.
>> >    * <p>If you choose different analyzers per field, use
>> >    * {@link #PER_FIELD_REUSE_STRATEGY}.
>> >    * @see #getReuseStrategy()
>> >    */
>> >
>> > Maybe it is correct for AW, but not for DelegatingAW?
>> >
>> > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY
>> > since SolrIndexAnalyzer returns different Analyzers for different fields
>> > (per their field-type). But all fields that share the same Analyzer
>> instance
>> > should be safe reusing its TokenStreamComponents, since we never process
>> > fields in parallel?
>> >
>> > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass
>> > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer
>> instances
>> > for different fields), but it's the only piece of the puzzle that
>> confuses
>> > me, since I trust whoever wrote this class to understand this stuff
>> better
>> > than I do ...
>> >
>> > What do you think?
>> >
>> > Shai
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Why do SolrIndex/QueryAnalyzers use PER_FIELD_REUSE_STRATEGY

Reply via email to