Re: [Ferret-talk] Proposal of some radical changes to API

Marvin Humphrey Tue, 06 Jun 2006 21:17:26 -0700

On Jun 6, 2006, at 6:08 PM, David Balmain wrote:

> What happens when there are deletes? Which files should I look in to
> see how this works? I really need to get my head around the KinoSearch
> merge model.


Let's say we're indexing a book.  It has three pages.

    page 1 => "peas porridge hot"
    page 2 => "peas porridge cold"
    page 3 => "peas porridge in the pot, nine days old"

Here's what Lucene does:

First, create a mini-inverted index for each page...

    hot      => 1
    peas     => 1
    porridge => 1

    cold     => 2
    peas     => 2
    porridge => 2

    days     => 3
    in       => 3
    nine     => 3
    old      => 3
    peas     => 3
    porridge => 3
    pot      => 3

Then combine the indexes...

    cold     => 2
    days     => 3
    in       => 3
    hot      => 1
    nine     => 3
    old      => 3
    peas     => 1, 2, 3
    porridge => 1, 2, 3
    pot      => 3

... and here's what KinoSearch does:

First, dump everything into one giant pool...

    peas     => 1
    porridge => 1
    hot      => 1
    peas     => 2
    porridge => 2
    cold     => 2
    peas     => 3
    porridge => 3
    in       => 3
    pot      => 3
    the      => 3
    nine     => 3
    days     => 3
    old      => 3

...then sort the whole thing in one go.

Make sense?

The big problem with the KinoSearch method is that you can't just  
keep dumping stuff into an array indefinitely -- you'll run out of  
memory, duh!  So what you need is an object that looks like an array  
that you can keep dumping stuff into forever.  Then you "sort" that  
"array".

That's where the external sort algorithm comes in.  The sortex object  
is basically a PriorityQueue of unlimited size, but which never  
occupies more than 20 or 30 megs of RAM because it periodically sorts  
and flushes its payload to disk. It recovers that stuff from disk  
later -- in sorted order -- when it's in fetching mode.

If you want to spelunk KinoSearch to see how this happens, start with  
Invindexer::add_doc().  After some minor fiddling, it feeds  
SegWriter::add_doc().  SegWriter goes through each field, having  
TokenBatch invert the field's contents, feeding the inverted and  
serialized but unordered postings into PostingsWriter (which is where  
the external sort object lives), and writing the norms byte.  Last,  
SegWriter hands the Doc object to FieldsWriter so that it can write  
the stored fields.

The most important part of the previous chain is the step that never  
happened: nobody ever invoked SegmentMerger by calling the equivalent  
of Lucene's maybeMergeSegments().  There IS no SegmentMerger in  
KinoSearch.

The rest of the process takes place when InvIndexer::finish() gets  
called.  This time, InvIndexer has a lot to do.

First, InvIndexer has to decide which segments need to be merged, if  
any, which it does using an algorithm based on the fibonacci series.   
If there are segments that need mergin', InvIndexer feeds each one of  
them to SegWriter::add_segment().  SegWriter has DelDocs generate a  
doc map which maps around deleted documents (just like Lucene).  Next  
it has FieldInfos reconcile the field defs and create a field number  
map, which maps field numbers from the segment that's about to get  
merged away to field numbers for the current segment.  SegWriter  
merges the norms itself.  Then it calls FieldsWriter::add_segment(),  
which reads fields off disk (without decompressing compressed fields,  
or creating document objects, or doing anything important except  
mapping to new field numbers) and writes them into their new home in  
the current segment.  Last, SegWriter arranges for  
PostingsWriter::add_segment to dump all the postings from the old  
segment into the current sort pool -- which *still* hasn't been  
sorted -- mapping to new field and document numbers as it goes.

(Think of add_segment as add_doc on steroids.)

Now that all documents and all merge-worthy segments have been  
processed, it's finally time to deal with the sort pool.  InvIndexer  
calls SegWriter::finish(), which calls PostingsWriter::finish().   
PostingsWriter::finish() does a little bit in Perl, then hands off to  
a heavy-duty C routine that goes through the sort pool one posting at  
a time, writing the .frq and .prx files itself, and feeding  
TermInfosWriter so that it can write the .tis and .tii files.

SegWriter::finish() also invokes closing routines for the  
FieldsWriter, the norms filehandles, and so on.  Last, it writes the  
compound file. (For simplicity's sake, and because there isn't much  
benefit to using the non-compound format under the KinoSearch merge  
model, KinoSearch always uses the compound format).

Now that all the writing is complete, InvIndexer has to commit the  
changes by rewriting the 'segments' file.  One interesting aspect of  
the KinoSearch merge model is that no matter how many documents you  
add or segments you merge, if the process gets interrupted at any  
time up till that single commit, the index remains unchanged.  In  
KinoSearch, InvIndexer handles deletions too (IndexReader isn't even  
a public class), and deletions -- at least those deletions which  
affect segments that haven't been optimized away -- are committed  
during at the same moment.

Deletable files are deleted if possible, the write lock is  
released... TADA! We're done.

... and since I spent so much time writing this up, I don't have time  
to respond to the other points.  Check y'all later...

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Proposal of some radical changes to API

Reply via email to