About flush by RAM

I was playing around with something similar on the 2.1 codebase
(roll-my-own)
and had the quirk of a possible *very* large incoming document. As in 250M.
So I had to put some logic in to try say, in effect, "if the incoming doc is
completely ridiculous, flush now". I should say that I was impressed that
we could even index the bloody thing at all!

Is this something that still needs to be guarded against in 2.3? In other
words, should the flush size be chosen so that (current RAM size + the
increment caused by the biggest doc possible in your data set) be < the
threshold?

You see the problem here. In the silliest case, where I have one HUGE
document
that barely fit in memory, I'd have to set the threshold very low, flushing
early
and often unless there was a "flush now" bit of logic for silly docs.

If you must know, the huge doc was the 23 volume "encyclopedia of Michigan
Civil War Volunteers". Yeah, yeah, sure. I could have done other things than
index it as a single doc, but since indexing speed wasn't really an issue
and
the PM wanted it that way and all it meant was that indexing took 6 hours
rather than, perhaps, 4 on a static data set, I didn't care enough to do
more work.

Don't get me wrong, having a flush by RAM size is sweet. And for any
reasonable
corpus, especially one with relatively constant input docs, it should be
very nice
indeed. I'm wondering about the outlier cases since I seem to run into them,
siiiggghh.
But "that's why they pay me the big bucks" <G>.

Thanks
Erick

Reply via email to