About flush by RAM I was playing around with something similar on the 2.1 codebase (roll-my-own) and had the quirk of a possible *very* large incoming document. As in 250M. So I had to put some logic in to try say, in effect, "if the incoming doc is completely ridiculous, flush now". I should say that I was impressed that we could even index the bloody thing at all!
Is this something that still needs to be guarded against in 2.3? In other words, should the flush size be chosen so that (current RAM size + the increment caused by the biggest doc possible in your data set) be < the threshold? You see the problem here. In the silliest case, where I have one HUGE document that barely fit in memory, I'd have to set the threshold very low, flushing early and often unless there was a "flush now" bit of logic for silly docs. If you must know, the huge doc was the 23 volume "encyclopedia of Michigan Civil War Volunteers". Yeah, yeah, sure. I could have done other things than index it as a single doc, but since indexing speed wasn't really an issue and the PM wanted it that way and all it meant was that indexing took 6 hours rather than, perhaps, 4 on a static data set, I didn't care enough to do more work. Don't get me wrong, having a flush by RAM size is sweet. And for any reasonable corpus, especially one with relatively constant input docs, it should be very nice indeed. I'm wondering about the outlier cases since I seem to run into them, siiiggghh. But "that's why they pay me the big bucks" <G>. Thanks Erick