On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:
I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.

The way I handled this in KS was to turn Posting into a class akin to TermBuffer: the individual Posting object persists, but its values change.

Meanwhile, each Posting subclass has a Read_Raw method which generates a "RawPosting". RawPosting objects are a serialized, sortable, lowest common denominator form of Posting which every subclass must be able to export. They're allocated from a specialized MemoryPool, making them cheap to manufacture and to release.

RawPosting is the only form PostingsWriter is actually required to know about:

   // PostingsWriter loop:
   while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
      writeRawPosting(rawPosting);
   }

I agree we would have an abstract base Posting class that just tracks
the term text.

IMO, the abstract base Posting class should not track text. It should include only one datum: a document number. This keeps it in line with the simplest IR definition for a "posting": one document matching one term.

   Posting:        doc num (abstract)
   MatchPosting:   doc num
   ScorePosting:   doc num, freq, per-doc boost, positions
   RichPosting:    doc num, freq, positions with per-position boost
   PayloadPosting: doc num, payload

Then, for search-time you have a PostingList class which takes the place of TermDocs/TermPositions, and uses an underlying Posting object to read the file. (PostingList and its subclasses don't know anything about file formats.)

Each Posting subclass is associated with a subclass of TermScorer which implements its own Posting-subclass-specific scoring algorithm.

   // MatchPostingScorer scoring algo ...
   while (postingList.next()) {
      MatchPosting posting = postingList.getPosting();
      collector.collect(posting.getDocNum(), 1.0);
   }

   // ScorePostingScorer scoring algo...
   while (postingList.next()) {
      ScorePosting posting = (ScorePosting)postingList.getPosting();
      int freq = posting.getFreq();
      float score = freq < TERMSCORER_SCORE_CACHE_SIZE
                  ? scoreCache[freq]            // cache hit
                  : sim.tf(freq) * weightValue;
      collector.collect(posting.getDocNum(), score);
   }

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.

I'm pessimistic that that anything that writes the current index format could be "easy to understand", because the spec is so dreadfully convoluted.

As I have argued before, the key is to have each Posting subclass wholly define a file format. That makes them pluggable, breaking the tight binding between the Lucene codebase and the Lucene file format spec.

Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.


Yes.  Here's how things are set up in KS:

   InvIndexer
      SegWriter
         DocWriter
         PostingsWriter
            LexWriter
         TermVectorsWriter
         // plug in more writers here?

Ideally, all of the writers under SegWriter would be subclasses of an abstract SegDataWriter class, and would implement addInversion() and addSegment(). SegWriter.addDoc() would look something like this:

   addDoc(Document doc) {
      Inversion inversion = invert(doc);
      for (int i = 0; i < writers.size; i++) {
         writers[i].addInversion(inversion);
      }
   }

In practice, three of the writers are required (one for term dictionary/lexicon, one for postings, and one for some form of document storage), but the design allows for plugging in additional SegDataWriter subclasses.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to