Thanks for clarifying that up. I thought I miss something :-)

No .. I don't use term vectors, only stored fields and indexed ones, no
norms or term vectors.

As for the efficiency of RAM usage by IndexWriter - what would perform
better: setting the RAM limit to 128MB, or create a RAMDirectory and add it
to an IndexWriter once it reaches 128 MB?


On Wed, Mar 19, 2008 at 6:32 PM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> Shai Erera wrote:
> > Hi
> >
> > I have a question on the setting of RAMBufferSizeMB on IndexWriter.
> > It may
> > sound like it belongs to the user list, but I actually think there
> > is a
> > problem with it, so I'm posting it to the dev list.
> >
> > I'm using 2.3.1 to index a set of documents (500K Amazon books to
> > be exact).
> > I don't use norms and most of the fields I index are also stored. I'm
> > setting IndexWriter like this:
> >             indexwriter.setRAMBufferSizeMB(128);
> >             indexwriter.setMaxBufferedDocs
> > (IndexWriter.DISABLE_AUTO_FLUSH);
> >
> > AFAIU, the first line would set the RAM usage by IW to 128MB and
> > the second
> > would disable flushing by doc count. Naturally, I'd expect nothing
> > to be
> > written to the file system until those 128MB are consumed.
>
> That's not quite right.  Stored fields and term vectors are written,
> immediately, into the directory, document by document.  The rest
> (posting lists, norms, terms, field infos) are buffered in RAM and
> flushed when RAM usage hits the limit.
>
> > However, that
> > does not seem to be the case. I watch the file system and do periodic
> > refresh (Windows) and I notice that stuff gets written to the disk
> > (.fdt
> > file) every few KB. Task Manager shows the application is not
> > consuming
> > 128MB ...
>
> Do you use term vectors?  There is one silly bug, which will be fixed
> in 2.3.2, whereby the storage used by term vectors was incorrectly
> counting as RAM usage.  That could explain flushing before you
> actually hit 128 MB.  Normally you'd see > 128 MB usage in task
> manager because other things consume RAM too...
>
> > So I debug-traced the application and noticed the following:
> > - DocumentsWriter calls fieldsWriter.flushDocument in writeDocument(),
> > passing a RAMOutputStream instance (fdtLocal).
> > - FieldsWriter calls RAMOutputStream.writeTo() and passes
> > fieldsStream,
> > which is of type FSIndexOutput.
> > - FSIndexOutput maintains an internal buffer of size 16KB (fixed) and
> > eventually flushes the buffer to the RandomAccessFile it maintains.
> >
> > So far, the 128MB setting was not applied anywhere, AFAIK.
>
> In DocumentsWriter, the "bufferIsFull"gets set to true by the
> balancRAM() method, which will then result in a flush.
>
> >
> > Can someone please explain me how this works? Am I missing
> > something (maybe
> > a patch post 2.3.1).
> >
> > One other thing I forgot to mention, I've started this
> > investigation after
> > playing with the RAM usage and maxBufferredDocs usage. Setting MBD
> > to 10,000
> > resulted in the same performance as setting RAM to 128MB, however it
> > consumed much less RAM (~70MB according to Windows' Task Manager,
> > which is
> > not the most accurate thing).
>
> Well ... there is a point of diminishing returns.  You can try
> rolling back your RAM buffer size and see at what point it no longer
> helps.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera

Reply via email to