Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
> On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:
>
> > I've actually been working on factoring DocumentsWriter, as a first
> > step towards flexible indexing.
> >
>
> The way I handled this in KS was to turn Posting into a class akin to
> TermBuffer: the individual Posting object persists, but its values change.
>
> Meanwhile, each Posting subclass has a Read_Raw method which generates a
> "RawPosting". RawPosting objects are a serialized, sortable, lowest common
> denominator form of Posting which every subclass must be able to export.
> They're allocated from a specialized MemoryPool, making them cheap to
> manufacture and to release.
>
> RawPosting is the only form PostingsWriter is actually required to know
> about:
>
> // PostingsWriter loop:
> while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
> writeRawPosting(rawPosting);
>
> }
>
> > I agree we would have an abstract base Posting class that just tracks
> > the term text.
> >
>
> IMO, the abstract base Posting class should not track text. It should
> include only one datum: a document number. This keeps it in line with the
> simplest IR definition for a "posting": one document matching one term.
But how do you then write out a segment with the terms packed, in
sorted order? Your "generic" layer needs to know how to sort these
Posting lists by term text, right?
> Posting: doc num (abstract)
> MatchPosting: doc num
> ScorePosting: doc num, freq, per-doc boost, positions
> RichPosting: doc num, freq, positions with per-position boost
> PayloadPosting: doc num, payload
OK I now see that what we call Posting really should be called
PostingList: each instance of this class, in DW, tracks all documents
that contained that term. Whereas for KS, Posting is a single
occurrence of term in a single doc, right? Does a Posting contain all
occurrences of the term in the doc (multiple positions) or only one?
How do you do buffering/flushing? After each document do you re-sweep
your Posting instances and write them into a single segment? Or do
accumulate many of these Posting instances (so many docs are held in
this form) and when RAM is full you flush to disk?
> Then, for search-time you have a PostingList class which takes the place of
> TermDocs/TermPositions, and uses an underlying Posting object to read the
> file. (PostingList and its subclasses don't know anything about file
> formats.)
Wouldn't PostingList need to know something of the file format? EG
maybe it's a sparse format (docID or gap encoded each time), or, it's
non-sparse (like norms, column-stride fields).
> Each Posting subclass is associated with a subclass of TermScorer which
> implements its own Posting-subclass-specific scoring algorithm.
>
> // MatchPostingScorer scoring algo ...
> while (postingList.next()) {
> MatchPosting posting = postingList.getPosting();
> collector.collect(posting.getDocNum(), 1.0);
> }
>
> // ScorePostingScorer scoring algo...
> while (postingList.next()) {
> ScorePosting posting = (ScorePosting)postingList.getPosting();
> int freq = posting.getFreq();
> float score = freq < TERMSCORER_SCORE_CACHE_SIZE
> ? scoreCache[freq] // cache hit
> : sim.tf(freq) * weightValue;
> collector.collect(posting.getDocNum(), score);
>
> }
>
>
> > And then the code that writes the current index format would plug into
> > this and should be fairly small and easy to understand.
> >
>
> I'm pessimistic that that anything that writes the current index format
> could be "easy to understand", because the spec is so dreadfully convoluted.
I'm quite a bit more optimistic here.
> As I have argued before, the key is to have each Posting subclass wholly
> define a file format. That makes them pluggable, breaking the tight binding
> between the Lucene codebase and the Lucene file format spec.
It's not just Posting that defines the file format. Things like
stored fields, norms, column-stride fields, have nothing to do with
inversion. So these writers/readers should "plug in" at a layer above
the inversion? OK, I see these below:
> > Then there would also be plugins that just tap into the entire
> > document (don't need inversion), like FieldsWriter.
> >
>
>
> Yes. Here's how things are set up in KS:
>
> InvIndexer
> SegWriter
> DocWriter
> PostingsWriter
> LexWriter
> TermVectorsWriter
> // plug in more writers here?
>
> Ideally, all of the writers under SegWriter would be subclasses of an
> abstract SegDataWriter class, and would implement addInversion() and
> addSegment(). SegWriter.addDoc() would look something like this:
>
> addDoc(Document doc) {
> Inversion inversion = invert(doc);
> for (int i = 0; i < writers.size; i++) {
> writers[i].addInversion(inversion);
> }
> }
I think TermVectorsWriter should be seen as a consumer of the
"inversion" plugin API. It's just that, unlike the frq/prx writer,
which flushes when RAM is full, the TermVectorsWriter flushes after
each doc. Ie, the generic code does the inversion, feeding "you"
Posting occurrences, and "you" write this to a file however you want.
> In practice, three of the writers are required (one for term
> dictionary/lexicon, one for postings, and one for some form of document
> storage), but the design allows for plugging in additional SegDataWriter
> subclasses.
OK.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]