Thank you! Sounds like it is a bad idea to rely on Accountable, the best path forward is for us to rethink how to manage our cache.
-John On Thu, May 28, 2020 at 7:02 AM Adrien Grand <jpou...@gmail.com> wrote: > I opened https://issues.apache.org/jira/browse/LUCENE-9387. > > On Thu, May 28, 2020 at 2:41 PM Michael McCandless < > luc...@mikemccandless.com> wrote: > >> +1 to remove Accountable from Lucene's reader classes. Let's open an >> issue and discuss there? >> >> In the past, when we added Accountable, Lucene's Codec/LeafReaders used >> quite a bit of heap, and the implementation was much closer to correct (as >> measured by %tg difference). >> >> But now that we've moved nearly everything off heap, the relatively small >> amount of RAM in use is really not worth tracking. >> >> I think John's use case, carefully relying on the precisely correct >> number when aggregated across many readers with many fields and maybe many >> segments, highlights how dangerous it is for Lucene to claim it is >> reporting heap usage when it can be substantially off (as seen by %tg >> difference). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Thu, May 28, 2020 at 2:51 AM Adrien Grand <jpou...@gmail.com> wrote: >> >>> To be clear, there is no plan to remove RAM accounting from readers yet, >>> this is just something that I have been thinking about recently, so your >>> use-case caught my attention. >>> >>> Given how low the memory usage is nowadays, I believe that it would be >>> extremely hard to make sure that RAM estimates are wrong by less than 2x in >>> most cases. This is due to the fact that everything counts now, so things >>> that we historically ignored for memory usage such as the maps you pointed >>> out can no longer be ignored. Unfortunately, these maps are not the only >>> thing that we ignored that we should no longer ignore: there are also field >>> infos, segment infos, BufferedIndexInput buffers, ... not to mention things >>> that we cache in threadlocals, as we would now need to count the number of >>> threads that have a cache entry. >>> >>> I'd suggest looking into opportunities to better contain the number of >>> fields and segments as David suggested, and caching a fixed number of >>> readers at most instead of relying on RAM accounting. >>> >>> >>> On Wed, May 27, 2020 at 11:21 PM John Wang <john.w...@gmail.com> wrote: >>> >>>> Thanks Adrien! >>>> >>>> It is surprising to learn this is an invalid use case and that Lucene >>>> is planning to get rid of memory accounting... >>>> >>>> In our test, there are indeed many fields. From our test, with 1000 >>>> numeric doc values fields, and 5 million docs in 1 segment. (We will have >>>> many segments in our production use case.) >>>> >>>> We found the memory usage by accounting for the elements in the maps vs >>>> the default behavior is 363456 to 59216, almost a 600% difference. >>>> >>>> We have deployments with much more than 1000 fields, so I don't think >>>> that is extreme. >>>> >>>> Our use case: >>>> >>>> We will have many segments/readers, and we found opening them at query >>>> time is expensive. So we are caching them. >>>> >>>> Since we don't know the data ahead of the time, we are using the >>>> reader's accounted memory as the cache size. >>>> >>>> We found the reader's accounting is unreliable, and dug into it and >>>> found this. >>>> >>>> If we should not be using this, what would be the correct way to handle >>>> this? >>>> >>>> Thank you >>>> >>>> -John >>>> >>>> >>>> On Wed, May 27, 2020 at 1:36 PM Adrien Grand <jpou...@gmail.com> wrote: >>>> >>>>> A couple major versions ago, Lucene required tons of heap memory to >>>>> keep a reader open, e.g. norms were on heap and so on. To my knowledge, >>>>> the >>>>> only thing that is now kept in memory and is a function of maxDoc is live >>>>> docs, all other codec components require very little memory. I'm actually >>>>> wondering that we should remove memory accounting on readers. When Lucene >>>>> used tons of memory we could focus on the main contributors to memory >>>>> usage >>>>> and be mostly correct. But now given how little memory Lucene is using >>>>> it's >>>>> quite hard to figure out what are the main contributing factors to memory >>>>> usage. And it's probably not that useful either, why is it important to >>>>> know how much memory something is using if it's not much? >>>>> >>>>> So I'd be curious to know more about your use-case for reader caching. >>>>> Would we break your use-case if we removed memory accounting on readers? >>>>> Given the lines that you are pointing out, I believe that you either have >>>>> many fields or many segments if these maps are using lots of memory? >>>>> >>>>> >>>>> On Wed, May 27, 2020 at 9:52 PM John Wang <john.w...@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> We have a reader cache that depends on the memory usage for each >>>>>> reader. We found the calculation of reader size for doc values to be >>>>>> under >>>>>> counting. >>>>>> >>>>>> See line: >>>>>> >>>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69 >>>>>> >>>>>> Looks like the memory estimate is only using the shallow size of the >>>>>> class, and does not include the objects stored in the maps: >>>>>> >>>>>> >>>>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55 >>>>>> >>>>>> We made a local patch and saw a significant difference in reported >>>>>> size. >>>>>> >>>>>> Please let us know if this is the right thing to do, we are happy to >>>>>> contribute our patch. >>>>>> >>>>>> Thanks >>>>>> >>>>>> -John >>>>>> >>>>> >>>>> >>>>> -- >>>>> Adrien >>>>> >>>> >>> >>> -- >>> Adrien >>> >> > > -- > Adrien >