[ 
https://issues.apache.org/jira/browse/LUCENE-5773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5773:
---------------------------------

    Attachment: LUCENE-5773.patch

Here is a patch. It compares the output of {{SegmentReader.ramBytesUsed}} 
against {{RamUsageTester}} for various codecs. In order to be successful the 
error needs to be either under 10% (relative) or 500 bytes (absolute) on an 
index on 100k documents with random small fields. The absolute value is needed 
for things that consume very little memory like Lucene's 4.9 norms with 
constant compression or stored fields. Otherwise it would very easily fail due 
to the constant overhead of the objects that we maintain to make SegmentReader 
work.

I had to refactor {{RamUsageTester}} a bit to make it work. In particular, I 
needed to make sure that pointers to other segments and to directory objects 
are not followed. Otherwise this would count eg. the buffers of the NIO 
directory's buffers.

It found a couple of interesting bugs although the default codec had pretty 
accurate estimations. Quick overview of things that have been fixed and/or are 
surprising:
 - PagedBytes.Reader assumed all pages had the same size. However with 
trim=true the last page is trimmed so the estimation could be quite far from 
accurate with large page sizes. It now returns the exact memory usage (as 
reported by RamUsageTester).
 - The various FSTs that we use in codecs sometimes have massive cached root 
arcs, MemoryPostingsFormat in particular but that was also the case for 
BlockTreeTermsReader (or maybe is it due to the test data?).
 - Other bugs were mostly about forgotten references, of things counted twice 
(eg. a paged bytes and a reader to the same pages).

> Test SegmentReader.ramBytesUsed
> -------------------------------
>
>                 Key: LUCENE-5773
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5773
>             Project: Lucene - Core
>          Issue Type: Test
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-5773.patch
>
>
> There have been cases where the memory reported by this API was larger than 
> the JVM heap size in the past so we should try to add some basic tests to it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to