Uwe fixed this in 4.10 with LUCENE-5803. Now we use GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to create field types per node instead of per core for more savings.
On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[email protected]> wrote: > Hi > > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap > usage by IndexSchema. This Solr in particular has one collection with 64 > shards (2 replicas, but 64 cores on one node). The schema has ~120 fields, > ~20 of them are of the same field type (text_general) and is serving around > 700 concurrent users (peak), with a thread pool limit of 1000. > > Reducing the thread-pool size is something they've tried, but the load is > high and the server keeps up fine with the load, and a thread pool that > size. > > What surprised me is that they report obscene numbers they see in the heap: > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I > thought that a TokenStreamComponents can be (and is) reused for all fields > in a document. And so even if we hold a ThreadLocal per > TokenStreamComponents, we should see 1000 of them at the most - per > Analyzer. And as I said, the analyzed fields are of type text_general, and > the rest of the fields are numeric, DV, String, Bool etc. (aka > not-analyzed). > > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy == > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the heap: > > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but could > be they served less than 700 users when the heap dump was taken). > > And if each such instance holds a zzBuffer of size 8KB, this amounts to >7GB > of heap space! > > Per Analyzer's constructor (which takes ReuseStrategy): > > /** > * Expert: create a new Analyzer with a custom {@link ReuseStrategy}. > * <p> > * NOTE: if you just want to reuse on a per-field basis, it's easier to > * use a subclass of {@link AnalyzerWrapper} such as > * <a > href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html"> > * PerFieldAnalyerWrapper</a> instead. > */ > > However, AnalyzerWrapper's documentation somewhat contradicts it (I think): > > /** > * Creates a new AnalyzerWrapper with the given reuse strategy. > * <p>If you want to wrap a single delegate Analyzer you can probably > * reuse its strategy when instantiating this subclass: > * {@code super(delegate.getReuseStrategy());}. > * <p>If you choose different analyzers per field, use > * {@link #PER_FIELD_REUSE_STRATEGY}. > * @see #getReuseStrategy() > */ > > Maybe it is correct for AW, but not for DelegatingAW? > > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY > since SolrIndexAnalyzer returns different Analyzers for different fields > (per their field-type). But all fields that share the same Analyzer instance > should be safe reusing its TokenStreamComponents, since we never process > fields in parallel? > > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer instances > for different fields), but it's the only piece of the puzzle that confuses > me, since I trust whoever wrote this class to understand this stuff better > than I do ... > > What do you think? > > Shai -- Regards, Shalin Shekhar Mangar. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
