What I also want to double check is, after I sent the email I thought about it some more, and I think the PER_FIELD is passed as a fallback strategy, but otherwise it uses the wrapped analyzer's strategy. Maybe in 4.7 before Uwe fixed things some Analyzers still returned PER_FIELD, but now they don't anymore. I will double check that too.
Shai On Jul 24, 2015 9:39 AM, "Shai Erera" <[email protected]> wrote: > Thanks Shalin, but I reviewed the code in trunk, and it still passes > PER_FIELD. I can double check but I'm pretty sure that's what I saw. > > Shai > On Jul 24, 2015 7:59 AM, "Shalin Shekhar Mangar" <[email protected]> > wrote: > >> Uwe fixed this in 4.10 with LUCENE-5803. Now we use >> GLOBAL_REUSE_STRATEGY on a per-field type basis. One of my todos is to >> create field types per node instead of per core for more savings. >> >> On Fri, Jul 24, 2015 at 3:24 AM, Shai Erera <[email protected]> wrote: >> > Hi >> > >> > I am helping to debug a Solr (4.7) deployment which shows >5.5GB of heap >> > usage by IndexSchema. This Solr in particular has one collection with 64 >> > shards (2 replicas, but 64 cores on one node). The schema has ~120 >> fields, >> > ~20 of them are of the same field type (text_general) and is serving >> around >> > 700 concurrent users (peak), with a thread pool limit of 1000. >> > >> > Reducing the thread-pool size is something they've tried, but the load >> is >> > high and the server keeps up fine with the load, and a thread pool that >> > size. >> > >> > What surprised me is that they report obscene numbers they see in the >> heap: >> > 680K (!!) objects of TokenStreamComponents, each holds a buffer of 8KB >> > coming from StandardTokenizerImpl.zzBuffer. That surprised me because I >> > thought that a TokenStreamComponents can be (and is) reused for all >> fields >> > in a document. And so even if we hold a ThreadLocal per >> > TokenStreamComponents, we should see 1000 of them at the most - per >> > Analyzer. And as I said, the analyzed fields are of type text_general, >> and >> > the rest of the fields are numeric, DV, String, Bool etc. (aka >> > not-analyzed). >> > >> > Reviewing IndexSchema it holds two instances: SolrIndexAnalyzer (extends >> > DelegatingAnalyzerWrapper) and SolrQueryAnalyzer (extends >> > SolrIndexAnalyzer). SolrIndexAnalyzer's constructor sets ReuseStrategy >> == >> > PER_FIELD_REUSE_STRATEGY. This might explain the 680K objects in the >> heap: >> > >> > 64 (cores) x 700 (threads) x 20 (fields) = 940K (more than 680K, but >> could >> > be they served less than 700 users when the heap dump was taken). >> > >> > And if each such instance holds a zzBuffer of size 8KB, this amounts to >> >7GB >> > of heap space! >> > >> > Per Analyzer's constructor (which takes ReuseStrategy): >> > >> > /** >> > * Expert: create a new Analyzer with a custom {@link ReuseStrategy}. >> > * <p> >> > * NOTE: if you just want to reuse on a per-field basis, it's easier >> to >> > * use a subclass of {@link AnalyzerWrapper} such as >> > * <a >> > >> href="{@docRoot}/../analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html"> >> > * PerFieldAnalyerWrapper</a> instead. >> > */ >> > >> > However, AnalyzerWrapper's documentation somewhat contradicts it (I >> think): >> > >> > /** >> > * Creates a new AnalyzerWrapper with the given reuse strategy. >> > * <p>If you want to wrap a single delegate Analyzer you can probably >> > * reuse its strategy when instantiating this subclass: >> > * {@code super(delegate.getReuseStrategy());}. >> > * <p>If you choose different analyzers per field, use >> > * {@link #PER_FIELD_REUSE_STRATEGY}. >> > * @see #getReuseStrategy() >> > */ >> > >> > Maybe it is correct for AW, but not for DelegatingAW? >> > >> > From what I understand, we should be OK setting a GLOBAL_REUSE_STRATEGY >> > since SolrIndexAnalyzer returns different Analyzers for different fields >> > (per their field-type). But all fields that share the same Analyzer >> instance >> > should be safe reusing its TokenStreamComponents, since we never process >> > fields in parallel? >> > >> > To that extent, I also feel like PerFieldAnalyzerWrapper shouldn't pass >> > PER_FIELD_REUSE_STRATEGY (since it too returns different Analyzer >> instances >> > for different fields), but it's the only piece of the puzzle that >> confuses >> > me, since I trust whoever wrote this class to understand this stuff >> better >> > than I do ... >> > >> > What do you think? >> > >> > Shai >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
