I think you misunderstood me - ultimately, the process reached 128MB. However it was flushing the .fdt file before it reached that. Your explanation on stored fields explains that behavior, but it did consume128MB.
Also, the CFS files that were written were of size >200MB (but less than 256) - which does not align with the 128MB setting. But I'm sure there's a good explanation to that as well :-) As for the RAMDirectory usage, I would think that if Lucene would store a true directory in-memory, with segments information and all, writing that to the file system would be as efficient as flushing big chunks of byte[], not having to process the postings and flush them (god forbid) one posting element at a time. The reason I'm worried about the performance of RAM vs. maxBufferredDocs (MBD) is that I was hoping that with Lucene 2.3, if I have a machine with 4GB of RAM available for indexing, I'll be able to utilize it. But according my small test, setting RAM to 128 or MBD to 10,000 (which consumed around 70 MB) gave the same performance. So I find myself asking whether flush by RAM usage is more useful than by MBD (as the documentation states). I will certainly post back any results I'll have, if I'll find something. Shai. On Wed, Mar 19, 2008 at 9:32 PM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > Shai Erera wrote: > > Thanks for clarifying that up. I thought I miss something :-) > > > > No .. I don't use term vectors, only stored fields and indexed > > ones, no > > norms or term vectors. > > Hmm, then it's hard to explain why when you set buffer to 128 MB you > never saw the process get up to that usage. > > > As for the efficiency of RAM usage by IndexWriter - what would perform > > better: setting the RAM limit to 128MB, or create a RAMDirectory > > and add it > > to an IndexWriter once it reaches 128 MB? > > That is a good question. Early versions of LUCENE-843 actually > flushed segments into a RAMDirectory and then once that RAMDir is > full, merged the segments to the real directory, using only a > fraction of the allowed RAM to hold the postings data. > > Whereas the final one (for simplicity) just uses the entire buffer to > hold the postings data. > > You can directly see the inefficiency by looking at the size of the > segments that are flushed: they are never the full size of the RAM > buffer, due to the overhead of maintaining a malleable data structure > that allows efficiently appending to the end of any term's posting list. > > But, I suspect this may actually give a decent performance gain if > you do use a RAMDirectory as an intermediary, except for the stored > fields / term vectors which just use up RAM unnecessarily. Really > you need a RAMDirectory that can somehow pass-through those files. > > If you do some testing here please post back the results! I think > this is a potential core change that could still give a sizable > further performance gain to IndexWriter's throughput. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera