Michael: Thanks, that's what I figured, but it's nice to have confirmed.
Erick On Jan 20, 2008 11:59 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Hi Erick, > > Yes, you do still need to guard against this case in 2.3. IndexWriter > checks the RAM usage after each doc is processed and flushes when > that's over the limit. > > However, the memory consumed by a very large doc should be quite a bit > less than before, because in 2.3 IndexWriter makes more more efficient > use of RAM. > > Mike > > Erick Erickson wrote: > > > About flush by RAM > > > > I was playing around with something similar on the 2.1 codebase > > (roll-my-own) > > and had the quirk of a possible *very* large incoming document. As > > in 250M. > > So I had to put some logic in to try say, in effect, "if the > > incoming doc is > > completely ridiculous, flush now". I should say that I was > > impressed that > > we could even index the bloody thing at all! > > > > Is this something that still needs to be guarded against in 2.3? In > > other > > words, should the flush size be chosen so that (current RAM size + the > > increment caused by the biggest doc possible in your data set) be < > > the > > threshold? > > > > You see the problem here. In the silliest case, where I have one HUGE > > document > > that barely fit in memory, I'd have to set the threshold very low, > > flushing > > early > > and often unless there was a "flush now" bit of logic for silly docs. > > > > If you must know, the huge doc was the 23 volume "encyclopedia of > > Michigan > > Civil War Volunteers". Yeah, yeah, sure. I could have done other > > things than > > index it as a single doc, but since indexing speed wasn't really an > > issue > > and > > the PM wanted it that way and all it meant was that indexing took 6 > > hours > > rather than, perhaps, 4 on a static data set, I didn't care enough > > to do > > more work. > > > > Don't get me wrong, having a flush by RAM size is sweet. And for any > > reasonable > > corpus, especially one with relatively constant input docs, it > > should be > > very nice > > indeed. I'm wondering about the outlier cases since I seem to run > > into them, > > siiiggghh. > > But "that's why they pay me the big bucks" <G>. > > > > Thanks > > Erick > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >