Mike - thanks for explanation, it makes perfect sense! Otis
----- Original Message ---- From: Michael McCandless <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Thursday, April 5, 2007 8:03:44 PM Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents Hi Otis! "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > You talk about a RAM buffer from 1MB - 96MB, but then you have the amount > of RAM @ flush time (e.g. Avg RAM used (MB) @ flush: old 34.5; new > 3.4 [ 90.1% less]). > > I don't follow 100% of what you are doing in LUCENE-843, so could you > please explain what these 2 different amounts of RAM are? > Is the first (1-96) the RAM you use for in-memory merging of segments? > What is the RAM used @ flush? More precisely, why is it that that amount > of RAM exceeds the RAM buffer? Very good questions! When I say "the RAM buffer size is set to 96 MB", what I mean is I flush the writer when the in-memory segments are using 96 MB RAM. On trunk, I just call ramSizeInBytes(). I do the analogous thing with my patch (sum up size of RAM buffers used by segments). I call this part of the RAM usage the "indexed documents RAM". With every added document, this grows. But: this does not account for all data structures (Posting instances, HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used, but not saved away, during the indexing of a single document. All the "things" used temporarily while indexing a document take up RAM too. I call this part of the RAM usage the "document processing RAM". This RAM does not grow with every added document, though its size is in proportion to the how big each document is. This memory is always re-used (does not grow with time). But with the trunk, this is done by creating garbage, whereas with my patch, I explicitly reuse it. When I measure "amount of RAM @ flush time", I'm calling MemoryMXBean.getHeapMemoryUsage().getUsed(). So, this measures actual process memory usage which should be (for my tests) around the sum of the above two types of RAM usage. With the trunk, the actual process memory usage tends to be quite a bit higher than the RAM buffer size and also tends to be very "noisy" (jumps around with each flush). I think this is because of delays/unpredictability on when GC kicks in to reclaim the garbage created during indexing of the doc. Whereas with my patch, it's usually quite a bit closer to the "indexed documents RAM" and does not jump around nearly as much. So the "actual process RAM used" will always exceed my "RAM buffer size". The amount of excess is a measure of the "overhead" required to process the document. The trunk has far worse overhead than with my patch, which I think means a given application will be able to use a *larger* RAM buffer size with LUCENE-843. Does that make sense? Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]