John: you may benefit from more eagerly merging small segments on commit.
At Salesforce we have a *ton* of indexes, and we reduced the segment count
in half from the default.  The large number of fields was a positive factor
in this being a desirable trade-off.  You might look at this recent issue
https://issues.apache.org/jira/browse/LUCENE-8962 which isn't released but
I show in it (with PRs to code) how to accomplish it without hacking on
Lucene itself.  You may find this conference presentation I gave with my
colleagues interesting, which touch on this:
https://youtu.be/hqeYAnsxPH8?t=855

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, May 27, 2020 at 5:21 PM John Wang <john.w...@gmail.com> wrote:

> Thanks Adrien!
>
> It is surprising to learn this is an invalid use case and that Lucene is
> planning to get rid of memory accounting...
>
> In our test, there are indeed many fields. From our test, with 1000
> numeric doc values fields, and 5 million docs in 1 segment. (We will have
> many segments in our production use case.)
>
> We found the memory usage by accounting for the elements in the maps vs
> the default behavior is 363456 to 59216, almost a 600% difference.
>
> We have deployments with much more than 1000 fields, so I don't think that
> is extreme.
>
> Our use case:
>
> We will have many segments/readers, and we found opening them at query
> time is expensive. So we are caching them.
>
> Since we don't know the data ahead of the time, we are using the reader's
> accounted memory as the cache size.
>
> We found the reader's accounting is unreliable, and dug into it and found
> this.
>
> If we should not be using this, what would be the correct way to handle
> this?
>
> Thank you
>
> -John
>
>
> On Wed, May 27, 2020 at 1:36 PM Adrien Grand <jpou...@gmail.com> wrote:
>
>> A couple major versions ago, Lucene required tons of heap memory to keep
>> a reader open, e.g. norms were on heap and so on. To my knowledge, the only
>> thing that is now kept in memory and is a function of maxDoc is live docs,
>> all other codec components require very little memory. I'm actually
>> wondering that we should remove memory accounting on readers. When Lucene
>> used tons of memory we could focus on the main contributors to memory usage
>> and be mostly correct. But now given how little memory Lucene is using it's
>> quite hard to figure out what are the main contributing factors to memory
>> usage. And it's probably not that useful either, why is it important to
>> know how much memory something is using if it's not much?
>>
>> So I'd be curious to know more about your use-case for reader caching.
>> Would we break your use-case if we removed memory accounting on readers?
>> Given the lines that you are pointing out, I believe that you either have
>> many fields or many segments if these maps are using lots of memory?
>>
>>
>> On Wed, May 27, 2020 at 9:52 PM John Wang <john.w...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We have a reader cache that depends on the memory usage for each reader.
>>> We found the calculation of reader size for doc values to be under counting.
>>>
>>> See line:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L69
>>>
>>> Looks like the memory estimate is only using the shallow size of the
>>> class, and does not include the objects stored in the maps:
>>>
>>>
>>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java#L55
>>>
>>> We made a local patch and saw a significant difference in reported size.
>>>
>>> Please let us know if this is the right thing to do, we are happy to
>>> contribute our patch.
>>>
>>> Thanks
>>>
>>> -John
>>>
>>
>>
>> --
>> Adrien
>>
>

Reply via email to