Hi Today, LogMP allows you to set different thresholds for segments sizes, thereby allowing you to control the largest segment that will be considered for merge + the largest segment your index will hold (=~ threshold * mergeFactor).
So, if you want to end up w/ say 20GB segments, you can set maxMergeMB(ForOptimize) to 2GB and mergeFactor=10. However, this often does not achieve your desired goal -- if the index contains 5 and 7 GB segments, they will never be merged b/c they are bigger than the threshold. I am willing to spend the CPU and IO resources to end up w/ 20 GB segments, whether I'm merging 10 segments together or only 2. After I reach a 20GB segment, it can rest peacefully, at least until I increase the threshold. So I wonder, first, if this threshold (i.e., largest segment size you would like to end up with) is more natural to set than thee current thresholds, from the application level? I.e., wouldn't it be a simpler threshold to set instead of doing weird calculus that depend on maxMergeMB(ForOptimize) and mergeFactor? Second, should this be an addition to LogMP, or a different type of MP. One that adheres to only those two factors (perhaps the segSize threshold should be allowed to set differently for optimize and regular merges). It can pick segments for merge such that it maximizes the result segment size (i.e., don't necessarily merge in sequential order), but not more than mergeFactor. I guess, if we think that maxResultSegmentSizeMB is more intuitive than the current thresholds, application-wise, then this change should go into LogMP. Otherwise, it feels like a different MP is needed, because LogMP is already complicated and another threshold would confuse things. What do you think of this? Am I trying to optimize too much? :) Shai