On Jun 6, 2006, at 6:08 PM, David Balmain wrote:
> What happens when there are deletes? Which files should I look in to
> see how this works? I really need to get my head around the KinoSearch
> merge model.
Let's say we're indexing a book. It has three pages.
page 1 => "peas porridge hot"
page 2 => "peas porridge cold"
page 3 => "peas porridge in the pot, nine days old"
Here's what Lucene does:
First, create a mini-inverted index for each page...
hot => 1
peas => 1
porridge => 1
cold => 2
peas => 2
porridge => 2
days => 3
in => 3
nine => 3
old => 3
peas => 3
porridge => 3
pot => 3
Then combine the indexes...
cold => 2
days => 3
in => 3
hot => 1
nine => 3
old => 3
peas => 1, 2, 3
porridge => 1, 2, 3
pot => 3
... and here's what KinoSearch does:
First, dump everything into one giant pool...
peas => 1
porridge => 1
hot => 1
peas => 2
porridge => 2
cold => 2
peas => 3
porridge => 3
in => 3
pot => 3
the => 3
nine => 3
days => 3
old => 3
...then sort the whole thing in one go.
Make sense?
The big problem with the KinoSearch method is that you can't just
keep dumping stuff into an array indefinitely -- you'll run out of
memory, duh! So what you need is an object that looks like an array
that you can keep dumping stuff into forever. Then you "sort" that
"array".
That's where the external sort algorithm comes in. The sortex object
is basically a PriorityQueue of unlimited size, but which never
occupies more than 20 or 30 megs of RAM because it periodically sorts
and flushes its payload to disk. It recovers that stuff from disk
later -- in sorted order -- when it's in fetching mode.
If you want to spelunk KinoSearch to see how this happens, start with
Invindexer::add_doc(). After some minor fiddling, it feeds
SegWriter::add_doc(). SegWriter goes through each field, having
TokenBatch invert the field's contents, feeding the inverted and
serialized but unordered postings into PostingsWriter (which is where
the external sort object lives), and writing the norms byte. Last,
SegWriter hands the Doc object to FieldsWriter so that it can write
the stored fields.
The most important part of the previous chain is the step that never
happened: nobody ever invoked SegmentMerger by calling the equivalent
of Lucene's maybeMergeSegments(). There IS no SegmentMerger in
KinoSearch.
The rest of the process takes place when InvIndexer::finish() gets
called. This time, InvIndexer has a lot to do.
First, InvIndexer has to decide which segments need to be merged, if
any, which it does using an algorithm based on the fibonacci series.
If there are segments that need mergin', InvIndexer feeds each one of
them to SegWriter::add_segment(). SegWriter has DelDocs generate a
doc map which maps around deleted documents (just like Lucene). Next
it has FieldInfos reconcile the field defs and create a field number
map, which maps field numbers from the segment that's about to get
merged away to field numbers for the current segment. SegWriter
merges the norms itself. Then it calls FieldsWriter::add_segment(),
which reads fields off disk (without decompressing compressed fields,
or creating document objects, or doing anything important except
mapping to new field numbers) and writes them into their new home in
the current segment. Last, SegWriter arranges for
PostingsWriter::add_segment to dump all the postings from the old
segment into the current sort pool -- which *still* hasn't been
sorted -- mapping to new field and document numbers as it goes.
(Think of add_segment as add_doc on steroids.)
Now that all documents and all merge-worthy segments have been
processed, it's finally time to deal with the sort pool. InvIndexer
calls SegWriter::finish(), which calls PostingsWriter::finish().
PostingsWriter::finish() does a little bit in Perl, then hands off to
a heavy-duty C routine that goes through the sort pool one posting at
a time, writing the .frq and .prx files itself, and feeding
TermInfosWriter so that it can write the .tis and .tii files.
SegWriter::finish() also invokes closing routines for the
FieldsWriter, the norms filehandles, and so on. Last, it writes the
compound file. (For simplicity's sake, and because there isn't much
benefit to using the non-compound format under the KinoSearch merge
model, KinoSearch always uses the compound format).
Now that all the writing is complete, InvIndexer has to commit the
changes by rewriting the 'segments' file. One interesting aspect of
the KinoSearch merge model is that no matter how many documents you
add or segments you merge, if the process gets interrupted at any
time up till that single commit, the index remains unchanged. In
KinoSearch, InvIndexer handles deletions too (IndexReader isn't even
a public class), and deletions -- at least those deletions which
affect segments that haven't been optimized away -- are committed
during at the same moment.
Deletable files are deleted if possible, the write lock is
released... TADA! We're done.
... and since I spent so much time writing this up, I don't have time
to respond to the other points. Check y'all later...
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk