Shai Erera wrote:
Hi
I have a question on the setting of RAMBufferSizeMB on IndexWriter.
It may
sound like it belongs to the user list, but I actually think there
is a
problem with it, so I'm posting it to the dev list.
I'm using 2.3.1 to index a set of documents (500K Amazon books to
be exact).
I don't use norms and most of the fields I index are also stored. I'm
setting IndexWriter like this:
indexwriter.setRAMBufferSizeMB(128);
indexwriter.setMaxBufferedDocs
(IndexWriter.DISABLE_AUTO_FLUSH);
AFAIU, the first line would set the RAM usage by IW to 128MB and
the second
would disable flushing by doc count. Naturally, I'd expect nothing
to be
written to the file system until those 128MB are consumed.
That's not quite right. Stored fields and term vectors are written,
immediately, into the directory, document by document. The rest
(posting lists, norms, terms, field infos) are buffered in RAM and
flushed when RAM usage hits the limit.
However, that
does not seem to be the case. I watch the file system and do periodic
refresh (Windows) and I notice that stuff gets written to the disk
(.fdt
file) every few KB. Task Manager shows the application is not
consuming
128MB ...
Do you use term vectors? There is one silly bug, which will be fixed
in 2.3.2, whereby the storage used by term vectors was incorrectly
counting as RAM usage. That could explain flushing before you
actually hit 128 MB. Normally you'd see > 128 MB usage in task
manager because other things consume RAM too...
So I debug-traced the application and noticed the following:
- DocumentsWriter calls fieldsWriter.flushDocument in writeDocument(),
passing a RAMOutputStream instance (fdtLocal).
- FieldsWriter calls RAMOutputStream.writeTo() and passes
fieldsStream,
which is of type FSIndexOutput.
- FSIndexOutput maintains an internal buffer of size 16KB (fixed) and
eventually flushes the buffer to the RandomAccessFile it maintains.
So far, the 128MB setting was not applied anywhere, AFAIK.
In DocumentsWriter, the "bufferIsFull"gets set to true by the
balancRAM() method, which will then result in a flush.
Can someone please explain me how this works? Am I missing
something (maybe
a patch post 2.3.1).
One other thing I forgot to mention, I've started this
investigation after
playing with the RAM usage and maxBufferredDocs usage. Setting MBD
to 10,000
resulted in the same performance as setting RAM to 128MB, however it
consumed much less RAM (~70MB according to Windows' Task Manager,
which is
not the most accurate thing).
Well ... there is a point of diminishing returns. You can try
rolling back your RAM buffer size and see at what point it no longer
helps.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]