On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:

Actually is #2 a hard requirement?

A lot of Lucene users depend on having document number correspond to age, I think. ISTR Hatcher at least recommending techniques that require it.

Do the loose ports of Lucene
(KinoSearch, Ferret, etc.) also follow this restriction?

KS: Nope.  So you can't use those tricks.

I think instead of calling segments "level N" we should just measure
their net sizes and merge on that basis?

Here's the fibonacci-series-based algorithm used in KinoSearch, taken from MultiReader:

sub segreaders_to_merge {
    my ( $self, $all ) = @_;
    return unless @{ $self->{sub_readers} };
    return @{ $self->{sub_readers} } if $all;

    # sort by ascending size in docs
    my @sorted_sub_readers
= sort { $a->num_docs <=> $b->num_docs } @{ $self-> {sub_readers} };

    # find sparsely populated segments
    my $total_docs = 0;
    my $threshold  = -1;
    for my $i ( 0 .. $#sorted_sub_readers ) {
        $total_docs += $sorted_sub_readers[$i]->num_docs;
        if ( $total_docs < fibonacci( $i + 5 ) ) {
            $threshold = $i;
        }
    }

    # if any of the segments are sparse, return their readers
    if ( $threshold > -1 ) {
        return @sorted_sub_readers[ 0 .. $threshold ];
    }
    else {
        return;
    }
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to