Shai Erera wrote:
Hi

I have a question on the setting of RAMBufferSizeMB on IndexWriter. It may sound like it belongs to the user list, but I actually think there is a
problem with it, so I'm posting it to the dev list.

I'm using 2.3.1 to index a set of documents (500K Amazon books to be exact).
I don't use norms and most of the fields I index are also stored. I'm
setting IndexWriter like this:
            indexwriter.setRAMBufferSizeMB(128);
indexwriter.setMaxBufferedDocs (IndexWriter.DISABLE_AUTO_FLUSH);

AFAIU, the first line would set the RAM usage by IW to 128MB and the second would disable flushing by doc count. Naturally, I'd expect nothing to be
written to the file system until those 128MB are consumed.

That's not quite right. Stored fields and term vectors are written, immediately, into the directory, document by document. The rest (posting lists, norms, terms, field infos) are buffered in RAM and flushed when RAM usage hits the limit.

However, that
does not seem to be the case. I watch the file system and do periodic
refresh (Windows) and I notice that stuff gets written to the disk (.fdt file) every few KB. Task Manager shows the application is not consuming
128MB ...

Do you use term vectors? There is one silly bug, which will be fixed in 2.3.2, whereby the storage used by term vectors was incorrectly counting as RAM usage. That could explain flushing before you actually hit 128 MB. Normally you'd see > 128 MB usage in task manager because other things consume RAM too...

So I debug-traced the application and noticed the following:
- DocumentsWriter calls fieldsWriter.flushDocument in writeDocument(),
passing a RAMOutputStream instance (fdtLocal).
- FieldsWriter calls RAMOutputStream.writeTo() and passes fieldsStream,
which is of type FSIndexOutput.
- FSIndexOutput maintains an internal buffer of size 16KB (fixed) and
eventually flushes the buffer to the RandomAccessFile it maintains.

So far, the 128MB setting was not applied anywhere, AFAIK.

In DocumentsWriter, the "bufferIsFull"gets set to true by the balancRAM() method, which will then result in a flush.


Can someone please explain me how this works? Am I missing something (maybe
a patch post 2.3.1).

One other thing I forgot to mention, I've started this investigation after playing with the RAM usage and maxBufferredDocs usage. Setting MBD to 10,000
resulted in the same performance as setting RAM to 128MB, however it
consumed much less RAM (~70MB according to Windows' Task Manager, which is
not the most accurate thing).

Well ... there is a point of diminishing returns. You can try rolling back your RAM buffer size and see at what point it no longer helps.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to