Ok thanks Uwe, this makes sense now! Shai On Jul 24, 2015 9:55 AM, "Uwe Schindler" <[email protected]> wrote:
> In 4.10, SolrIndexAnalyzer extends DelegatingAnalyzerWrapper instead of > AnalyzerWrapper. > > > > DelegatingAnalyzerWrapper has its own „ReuseStrategy“. The perField one > here and is used as fallback only (for incompatible configurations, e.g. > when one of the per-field configs wrap with a filter or charfilter – but > this does not happen in Solr for fields). See the patch: > https://issues.apache.org/jira/secure/attachment/12654117/LUCENE-5803.patch > > > > It is very important that you use the PER_FIELD one as fallback strategy, > because otherwise it would break stuff like AnalysisRequestHandler (because > this one wraps). The Analyzer works per field, so any unknown delegate must > be cached per field. > > > > The idea of LUCENE-5803 is to also delegate the “caching”. If the > SolrAnalyzer does not wrap components, it can also delegate the caching. > The wrapper’s reuse strategy is then unused. The delegate, FieldType’s > Analyzer, uses TokenizerChain as Analyzer, which is GLOBAL_REUSE, so each > FieldType caches globally, no matter how many field instances.. > > > > So all is fine, it is just 4.7 where this optimization is not used. > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: [email protected] > > > > *From:* Shai Erera [mailto:[email protected]] > *Sent:* Friday, July 24, 2015 8:39 AM > *To:* [email protected] > *Subject:* Re: Why do SolrIndex/QueryAnalyzers use > PER_FIELD_REUSE_STRATEGY > > > > Thanks Shalin, but I reviewed the code in trunk, and it still passes > PER_FIELD. I can double check but I'm pretty sure that's what I saw. > > Shai > > On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[email protected]> > wrote: > > Uwe fixed this in 4.10 with LUCENE-5803. Now we use > GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to > create field types per node instead of per core for more savings. > > On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[email protected]> wrote: > > Hi > > > > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap > > usage by IndexSchema. This Solr in particular has one collection with 64 > > shards (2 replicas, but 64 cores on one node). The schema has ~120 > fields, > > ~20 of them are of the same field type (text_general) and is serving > around > > 700 concurrent users (peak), with a thread pool limit of 1000. > > > > Reducing the thread-pool size is something they've tried, but the load is > > high and the server keeps up fine with the load, and a thread pool that > > size. > > > > What surprised me is that they report obscene numbers they see in the > heap: > > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB > > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I > > thought that a TokenStreamComponents can be (and is) reused for all > fields > > in a document. And so even if we hold a ThreadLocal per > > TokenStreamComponents, we should see 1000 of them at the most - per > > Analyzer. And as I said, the analyzed fields are of type text_general, > and > > the rest of the fields are numeric, DV, String, Bool etc. (aka > > not-analyzed). > > > > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends > > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends > > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy == > > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the > heap: > > > > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but > could > > be they served less than 700 users when the heap dump was taken). > > > > And if each such instance holds a zzBuffer of size 8KB, this amounts to > >7GB > > of heap space! > > > > Per Analyzer's constructor (which takes ReuseStrategy): > > > > /** > > * Expert: create a new Analyzer with a custom {@link ReuseStrategy}. > > * <p> > > * NOTE: if you just want to reuse on a per-field basis, it's easier to > > * use a subclass of {@link AnalyzerWrapper} such as > > * <a > > href=" > {@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html > "> > > * PerFieldAnalyerWrapper</a> instead. > > */ > > > > However, AnalyzerWrapper's documentation somewhat contradicts it (I > think): > > > > /** > > * Creates a new AnalyzerWrapper with the given reuse strategy. > > * <p>If you want to wrap a single delegate Analyzer you can probably > > * reuse its strategy when instantiating this subclass: > > * {@code super(delegate.getReuseStrategy());}. > > * <p>If you choose different analyzers per field, use > > * {@link #PER_FIELD_REUSE_STRATEGY}. > > * @see #getReuseStrategy() > > */ > > > > Maybe it is correct for AW, but not for DelegatingAW? > > > > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY > > since SolrIndexAnalyzer returns different Analyzers for different fields > > (per their field-type). But all fields that share the same Analyzer > instance > > should be safe reusing its TokenStreamComponents, since we never process > > fields in parallel? > > > > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass > > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer > instances > > for different fields), but it's the only piece of the puzzle that > confuses > > me, since I trust whoever wrote this class to understand this stuff > better > > than I do ... > > > > What do you think? > > > > Shai > > > > -- > Regards, > Shalin Shekhar Mangar. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
