On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:
Actually is #2 a hard requirement?
A lot of Lucene users depend on having document number correspond to
age, I think. ISTR Hatcher at least recommending techniques that
require it.
Do the loose ports of Lucene
(KinoSearch, Ferret, etc.) also follow this restriction?
KS: Nope. So you can't use those tricks.
I think instead of calling segments "level N" we should just measure
their net sizes and merge on that basis?
Here's the fibonacci-series-based algorithm used in KinoSearch, taken
from MultiReader:
sub segreaders_to_merge {
my ( $self, $all ) = @_;
return unless @{ $self->{sub_readers} };
return @{ $self->{sub_readers} } if $all;
# sort by ascending size in docs
my @sorted_sub_readers
= sort { $a->num_docs <=> $b->num_docs } @{ $self->
{sub_readers} };
# find sparsely populated segments
my $total_docs = 0;
my $threshold = -1;
for my $i ( 0 .. $#sorted_sub_readers ) {
$total_docs += $sorted_sub_readers[$i]->num_docs;
if ( $total_docs < fibonacci( $i + 5 ) ) {
$threshold = $i;
}
}
# if any of the segments are sparse, return their readers
if ( $threshold > -1 ) {
return @sorted_sub_readers[ 0 .. $threshold ];
}
else {
return;
}
}
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]