On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
Yes the code re-computes the level of a given segment from the current values of maxBufferedDocs & mergeFactor. But when these values have changed (or, segments were flushed by RAM not by maxBufferedDocs) then the way it computes level no longer results in the logarithmic policy that it's trying to implement, I think.
The algorithm gradually re-adjusts toward the latest maxBufferedDocs & mergeFactor - see case 3 of the "Overview of merge policy" comment in the code. With the modification that RAM or file size as segment size, the algorithm would work by maxBufferedSize & mergeFactor. Let's say maxBufferedDocs or maxBufferedSize is the base size. Lucene-845 complains that the merge behaviour for segments <= base size in some cases is not logrithmic. It's a tradeoff. We always keep small segments in check. The algorithm reflects the tradeoff made when segments <= base size.
Exactly, when logarithmic works "correctly" (you don't change mergeFactor/maxBufferedDocs and your docs are all uniform in size), it does achieve this "merge roughly equal size in byte" segments (yes those two numbers are roughly equal). Though now I have to go ponder KS's Fibonacci series approach!.
It doesn't have to be Fibonacci series. Logrithmic would work well too. The main difference is KS can choose any segments to merge, not just adjacent segments. Thus it may find better candidates for merge.
Basically, this would keep the same logarithmic approach now, but derive levels somehow from the net size in bytes.
Exactly! Levels defined in size in bytes. Cheers, Ning --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]