On Apr 8, 2008, at 10:25 AM, Michael McCandless wrote:
I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.
The way I handled this in KS was to turn Posting into a class akin to
TermBuffer: the individual Posting object persists, but its values
change.
Meanwhile, each Posting subclass has a Read_Raw method which generates
a "RawPosting". RawPosting objects are a serialized, sortable, lowest
common denominator form of Posting which every subclass must be able
to export. They're allocated from a specialized MemoryPool, making
them cheap to manufacture and to release.
RawPosting is the only form PostingsWriter is actually required to
know about:
// PostingsWriter loop:
while ((RawPosting rawPosting = rawPostingQueue.pop()) != null) {
writeRawPosting(rawPosting);
}
I agree we would have an abstract base Posting class that just tracks
the term text.
IMO, the abstract base Posting class should not track text. It should
include only one datum: a document number. This keeps it in line with
the simplest IR definition for a "posting": one document matching one
term.
Posting: doc num (abstract)
MatchPosting: doc num
ScorePosting: doc num, freq, per-doc boost, positions
RichPosting: doc num, freq, positions with per-position boost
PayloadPosting: doc num, payload
Then, for search-time you have a PostingList class which takes the
place of TermDocs/TermPositions, and uses an underlying Posting object
to read the file. (PostingList and its subclasses don't know anything
about file formats.)
Each Posting subclass is associated with a subclass of TermScorer
which implements its own Posting-subclass-specific scoring algorithm.
// MatchPostingScorer scoring algo ...
while (postingList.next()) {
MatchPosting posting = postingList.getPosting();
collector.collect(posting.getDocNum(), 1.0);
}
// ScorePostingScorer scoring algo...
while (postingList.next()) {
ScorePosting posting = (ScorePosting)postingList.getPosting();
int freq = posting.getFreq();
float score = freq < TERMSCORER_SCORE_CACHE_SIZE
? scoreCache[freq] // cache hit
: sim.tf(freq) * weightValue;
collector.collect(posting.getDocNum(), score);
}
And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.
I'm pessimistic that that anything that writes the current index
format could be "easy to understand", because the spec is so
dreadfully convoluted.
As I have argued before, the key is to have each Posting subclass
wholly define a file format. That makes them pluggable, breaking the
tight binding between the Lucene codebase and the Lucene file format
spec.
Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.
Yes. Here's how things are set up in KS:
InvIndexer
SegWriter
DocWriter
PostingsWriter
LexWriter
TermVectorsWriter
// plug in more writers here?
Ideally, all of the writers under SegWriter would be subclasses of an
abstract SegDataWriter class, and would implement addInversion() and
addSegment(). SegWriter.addDoc() would look something like this:
addDoc(Document doc) {
Inversion inversion = invert(doc);
for (int i = 0; i < writers.size; i++) {
writers[i].addInversion(inversion);
}
}
In practice, three of the writers are required (one for term
dictionary/lexicon, one for postings, and one for some form of
document storage), but the design allows for plugging in additional
SegDataWriter subclasses.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]