Mike - thanks for explanation, it makes perfect sense!

Otis

----- Original Message ----
From: Michael McCandless <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 8:03:44 PM
Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to 
  buffer added documents


Hi Otis!

"Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
> You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
> of RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old    34.5; new   
>  3.4 [   90.1% less]).
> 
> I don't follow 100% of what you are doing in LUCENE-843, so could you
> please explain what these 2 different amounts of RAM are?
> Is the first (1-96) the RAM you use for in-memory merging of segments?
> What is the RAM used @ flush?  More precisely, why is it that that amount
> of RAM exceeds the RAM buffer?

Very good questions!

When I say "the RAM buffer size is set to 96 MB", what I mean is I
flush the writer when the in-memory segments are using 96 MB RAM.  On
trunk, I just call ramSizeInBytes().  I do the analogous thing with my
patch (sum up size of RAM buffers used by segments).  I call this part
of the RAM usage the "indexed documents RAM".  With every added
document, this grows.

But: this does not account for all data structures (Posting instances,
HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used,
but not saved away, during the indexing of a single document.  All the
"things" used temporarily while indexing a document take up RAM too.
I call this part of the RAM usage the "document processing RAM".  This
RAM does not grow with every added document, though its size is in
proportion to the how big each document is.  This memory is always
re-used (does not grow with time).  But with the trunk, this is done
by creating garbage, whereas with my patch, I explicitly reuse it.

When I measure "amount of RAM @ flush time", I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.

With the trunk, the actual process memory usage tends to be quite a
bit higher than the RAM buffer size and also tends to be very "noisy"
(jumps around with each flush).  I think this is because of
delays/unpredictability on when GC kicks in to reclaim the garbage
created during indexing of the doc.  Whereas with my patch, it's
usually quite a bit closer to the "indexed documents RAM" and does not
jump around nearly as much.

So the "actual process RAM used" will always exceed my "RAM buffer
size".  The amount of excess is a measure of the "overhead" required
to process the document.  The trunk has far worse overhead than with
my patch, which I think means a given application will be able to use
a *larger* RAM buffer size with LUCENE-843.

Does that make sense?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to