Thanks for clarifying that up. I thought I miss something :-) No .. I don't use term vectors, only stored fields and indexed ones, no norms or term vectors.
As for the efficiency of RAM usage by IndexWriter - what would perform better: setting the RAM limit to 128MB, or create a RAMDirectory and add it to an IndexWriter once it reaches 128 MB? On Wed, Mar 19, 2008 at 6:32 PM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > Shai Erera wrote: > > Hi > > > > I have a question on the setting of RAMBufferSizeMB on IndexWriter. > > It may > > sound like it belongs to the user list, but I actually think there > > is a > > problem with it, so I'm posting it to the dev list. > > > > I'm using 2.3.1 to index a set of documents (500K Amazon books to > > be exact). > > I don't use norms and most of the fields I index are also stored. I'm > > setting IndexWriter like this: > > indexwriter.setRAMBufferSizeMB(128); > > indexwriter.setMaxBufferedDocs > > (IndexWriter.DISABLE_AUTO_FLUSH); > > > > AFAIU, the first line would set the RAM usage by IW to 128MB and > > the second > > would disable flushing by doc count. Naturally, I'd expect nothing > > to be > > written to the file system until those 128MB are consumed. > > That's not quite right. Stored fields and term vectors are written, > immediately, into the directory, document by document. The rest > (posting lists, norms, terms, field infos) are buffered in RAM and > flushed when RAM usage hits the limit. > > > However, that > > does not seem to be the case. I watch the file system and do periodic > > refresh (Windows) and I notice that stuff gets written to the disk > > (.fdt > > file) every few KB. Task Manager shows the application is not > > consuming > > 128MB ... > > Do you use term vectors? There is one silly bug, which will be fixed > in 2.3.2, whereby the storage used by term vectors was incorrectly > counting as RAM usage. That could explain flushing before you > actually hit 128 MB. Normally you'd see > 128 MB usage in task > manager because other things consume RAM too... > > > So I debug-traced the application and noticed the following: > > - DocumentsWriter calls fieldsWriter.flushDocument in writeDocument(), > > passing a RAMOutputStream instance (fdtLocal). > > - FieldsWriter calls RAMOutputStream.writeTo() and passes > > fieldsStream, > > which is of type FSIndexOutput. > > - FSIndexOutput maintains an internal buffer of size 16KB (fixed) and > > eventually flushes the buffer to the RandomAccessFile it maintains. > > > > So far, the 128MB setting was not applied anywhere, AFAIK. > > In DocumentsWriter, the "bufferIsFull"gets set to true by the > balancRAM() method, which will then result in a flush. > > > > > Can someone please explain me how this works? Am I missing > > something (maybe > > a patch post 2.3.1). > > > > One other thing I forgot to mention, I've started this > > investigation after > > playing with the RAM usage and maxBufferredDocs usage. Setting MBD > > to 10,000 > > resulted in the same performance as setting RAM to 128MB, however it > > consumed much less RAM (~70MB according to Windows' Task Manager, > > which is > > not the most accurate thing). > > Well ... there is a point of diminishing returns. You can try > rolling back your RAM buffer size and see at what point it no longer > helps. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera