Thanks Adrien! It is surprising to learn this is an invalid use case and that Lucene is planning to get rid of memory accounting...
In our test, there are indeed many fields. From our test, with 1000 numeric doc values fields, and 5 million docs in 1 segment. (We will have many segments in our production use case.) We found the memory usage by accounting for the elements in the maps vs the default behavior is 363456 to 59216, almost a 600% difference. We have deployments with much more than 1000 fields, so I don't think that is extreme. Our use case: We will have many segments/readers, and we found opening them at query time is expensive. So we are caching them. Since we don't know the data ahead of the time, we are using the reader's accounted memory as the cache size. We found the reader's accounting is unreliable, and dug into it and found this. If we should not be using this, what would be the correct way to handle this? Thank you -John On Wed, May 27, 2020 at 1:36 PM Adrien Grand <jpou...@gmail.com> wrote: > A couple major versions ago, Lucene required tons of heap memory to keep a > reader open, e.g. norms were on heap and so on. To my knowledge, the only > thing that is now kept in memory and is a function of maxDoc is live docs, > all other codec components require very little memory. I'm actually > wondering that we should remove memory accounting on readers. When Lucene > used tons of memory we could focus on the main contributors to memory usage > and be mostly correct. But now given how little memory Lucene is using it's > quite hard to figure out what are the main contributing factors to memory > usage. And it's probably not that useful either, why is it important to > know how much memory something is using if it's not much? > > So I'd be curious to know more about your use-case for reader caching. > Would we break your use-case if we removed memory accounting on readers? > Given the lines that you are pointing out, I believe that you either have > many fields or many segments if these maps are using lots of memory? > > > On Wed, May 27, 2020 at 9:52 PM John Wang <john.w...@gmail.com> wrote: > >> Hello, >> >> We have a reader cache that depends on the memory usage for each reader. >> We found the calculation of reader size for doc values to be under counting. >> >> See line: >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69 >> >> Looks like the memory estimate is only using the shallow size of the >> class, and does not include the objects stored in the maps: >> >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55 >> >> We made a local patch and saw a significant difference in reported size. >> >> Please let us know if this is the right thing to do, we are happy to >> contribute our patch. >> >> Thanks >> >> -John >> > > > -- > Adrien >