John: you may benefit from more eagerly merging small segments on commit. At Salesforce we have a *ton* of indexes, and we reduced the segment count in half from the default. The large number of fields was a positive factor in this being a desirable trade-off. You might look at this recent issue https://issues.apache.org/jira/browse/LUCENE-8962 which isn't released but I show in it (with PRs to code) how to accomplish it without hacking on Lucene itself. You may find this conference presentation I gave with my colleagues interesting, which touch on this: https://youtu.be/hqeYAnsxPH8?t=855
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Wed, May 27, 2020 at 5:21 PM John Wang <john.w...@gmail.com> wrote: > Thanks Adrien! > > It is surprising to learn this is an invalid use case and that Lucene is > planning to get rid of memory accounting... > > In our test, there are indeed many fields. From our test, with 1000 > numeric doc values fields, and 5 million docs in 1 segment. (We will have > many segments in our production use case.) > > We found the memory usage by accounting for the elements in the maps vs > the default behavior is 363456 to 59216, almost a 600% difference. > > We have deployments with much more than 1000 fields, so I don't think that > is extreme. > > Our use case: > > We will have many segments/readers, and we found opening them at query > time is expensive. So we are caching them. > > Since we don't know the data ahead of the time, we are using the reader's > accounted memory as the cache size. > > We found the reader's accounting is unreliable, and dug into it and found > this. > > If we should not be using this, what would be the correct way to handle > this? > > Thank you > > -John > > > On Wed, May 27, 2020 at 1:36 PM Adrien Grand <jpou...@gmail.com> wrote: > >> A couple major versions ago, Lucene required tons of heap memory to keep >> a reader open, e.g. norms were on heap and so on. To my knowledge, the only >> thing that is now kept in memory and is a function of maxDoc is live docs, >> all other codec components require very little memory. I'm actually >> wondering that we should remove memory accounting on readers. When Lucene >> used tons of memory we could focus on the main contributors to memory usage >> and be mostly correct. But now given how little memory Lucene is using it's >> quite hard to figure out what are the main contributing factors to memory >> usage. And it's probably not that useful either, why is it important to >> know how much memory something is using if it's not much? >> >> So I'd be curious to know more about your use-case for reader caching. >> Would we break your use-case if we removed memory accounting on readers? >> Given the lines that you are pointing out, I believe that you either have >> many fields or many segments if these maps are using lots of memory? >> >> >> On Wed, May 27, 2020 at 9:52 PM John Wang <john.w...@gmail.com> wrote: >> >>> Hello, >>> >>> We have a reader cache that depends on the memory usage for each reader. >>> We found the calculation of reader size for doc values to be under counting. >>> >>> See line: >>> >>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69 >>> >>> Looks like the memory estimate is only using the shallow size of the >>> class, and does not include the objects stored in the maps: >>> >>> >>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55 >>> >>> We made a local patch and saw a significant difference in reported size. >>> >>> Please let us know if this is the right thing to do, we are happy to >>> contribute our patch. >>> >>> Thanks >>> >>> -John >>> >> >> >> -- >> Adrien >> >