Re: Pooling of posting objects in DocumentsWriter

Marvin Humphrey Tue, 08 Apr 2008 16:31:39 -0700


On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:

I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.

The way I handled this in KS was to turn Posting into a class akin toTermBuffer: the individual Posting object persists, but its valueschange.

Meanwhile, each Posting subclass has a Read_Raw method which generatesa "RawPosting". RawPosting objects are a serialized, sortable, lowestcommon denominator form of Posting which every subclass must be ableto export. They're allocated from a specialized MemoryPool, makingthem cheap to manufacture and to release.

RawPosting is the only form PostingsWriter is actually required toknow about:


   // PostingsWriter loop:
   while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
      writeRawPosting(rawPosting);
   }

I agree we would have an abstract base Posting class that just tracks
the term text.

IMO, the abstract base Posting class should not track text. It shouldinclude only one datum: a document number. This keeps it in line withthe simplest IR definition for a "posting": one document matching oneterm.


   Posting:        doc num (abstract)
   MatchPosting:   doc num
   ScorePosting:   doc num, freq, per-doc boost, positions
   RichPosting:    doc num, freq, positions with per-position boost
   PayloadPosting: doc num, payload

Then, for search-time you have a PostingList class which takes theplace of TermDocs/TermPositions, and uses an underlying Posting objectto read the file. (PostingList and its subclasses don't know anythingabout file formats.)

Each Posting subclass is associated with a subclass of TermScorerwhich implements its own Posting-subclass-specific scoring algorithm.


   // MatchPostingScorer scoring algo ...
   while (postingList.next()) {
      MatchPosting posting = postingList.getPosting();
      collector.collect(posting.getDocNum(), 1.0);
   }

   // ScorePostingScorer scoring algo...
   while (postingList.next()) {
      ScorePosting posting = (ScorePosting)postingList.getPosting();
      int freq = posting.getFreq();
      float score = freq < TERMSCORER_SCORE_CACHE_SIZE
                  ? scoreCache[freq]            // cache hit
                  : sim.tf(freq) * weightValue;
      collector.collect(posting.getDocNum(), score);
   }

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.

I'm pessimistic that that anything that writes the current indexformat could be "easy to understand", because the spec is sodreadfully convoluted.

As I have argued before, the key is to have each Posting subclasswholly define a file format. That makes them pluggable, breaking thetight binding between the Lucene codebase and the Lucene file formatspec.

Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.



Yes.  Here's how things are set up in KS:

   InvIndexer
      SegWriter
         DocWriter
         PostingsWriter
            LexWriter
         TermVectorsWriter
         // plug in more writers here?

Ideally, all of the writers under SegWriter would be subclasses of anabstract SegDataWriter class, and would implement addInversion() andaddSegment(). SegWriter.addDoc() would look something like this:


   addDoc(Document doc) {
      Inversion inversion = invert(doc);
      for (int i = 0; i < writers.size; i++) {
         writers[i].addInversion(inversion);
      }
   }

In practice, three of the writers are required (one for termdictionary/lexicon, one for postings, and one for some form ofdocument storage), but the design allows for plugging in additionalSegDataWriter subclasses.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pooling of posting objects in DocumentsWriter

Reply via email to