On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote:

Can't you compartmentalize while still serializing skip data into the
single frq/prx file?

Yes, that's possible.

The way KS is set up right now, PostingList objects maintain i/o state, and Posting's Read_Record() method just deals with whatever instream gets passed to it. If the PostingList were to sneak in the reading of a skip packet, the Posting would be none the wiser.

This as analagous to how videos are encoded.  EG the AVI file format
is a "container" format, and in contains packets of video and packets
of audio, interleaved at the right rate so a player can play both in
sync.  The "container" has no idea how to decode the audio and video
packets.  Separate codecs do that.

Taking this back to Lucene, there's a container format that, using
TermInfo, knows where the frq/prx data (packet) is and where the skip
data (packet) is.  And it calls on separate decoders to decode each.

This is an intriguing proposal.  :)

The dev branch of KS currently uses oodles of per-segment files for the lexicon and the postings:

  * One postings file per field per segment.      [SEGNAME-FIELDNUM.p]
* One lexicon file per field per segment. [SEGNAME- FIELDNUM.lex] * One lexicon index file per field per segment. [SEGNAME- FIELDNUM.lexx]

Having so many files is something of a drawback, but it means that each individual file can be very specialized, and that yields numerous benefits:

  * Each file has a simple format.
  * File Format spec easier to write and understand.
  * Formats are pluggable.
      o Easy to deprecate.
      o Easy to isolate within a single class.
  * PostingList objects are always single-field.
      o Simplified internals.
          * No field numbers to track.
          * Repeat one read operation to scan the whole file.
      o Pluggable using subclasses of Posting.
      o Fewer subclasses (e.g. SegmentTermPositions is not needed).
  * Lexicon objects are always single-field.
      o Simplified internals.
          * No field numbers to track.
          * Repeat one read operation to scan the whole file.
      o Possible to extend with custom per-field sorting at index-time.
      o Easier to extend to non-text terms.
          * Comparisons ops guaranteed to see like objects.
  * Stream-related errors are comparatively easy to track down.

Some of these benefits are preserved when reading from a single stream. However, there are some downsides:

  * Container classes like PostingList more complex.
      o No longer single-field.
      o Harder to detect overruns that would have been EOF errors.
      o Easier to lose stream sync.
      o Periodic sampling for index records more complex.
* Tricky to prevent inappropriate compareTo ops at boundaries.
  * Harder to troubleshoot.
      o Glitch in one plugin can manifest as an error somewhere else.
      o Hexdump nearly impossible to interpret.
o Mentally taxing to follow like packets in an interleaved stream.
  * File corruption harder to recover from.
      o Only as reliable as the weakest plugin.

Benefits of the single stream include:

  * Fewer hard disk seeks.
  * Far fewer files.

If you're using Lucene's non-compound file format, having far fewer files could be a significant benefit depending on the OS. But here's the thing:

If you're using a custom virtual file system a la Lucene's compound files, what's the difference between divvying up data using filenames within the CompoundFileReader object, and divvying up data downstream in some other object using some ad hoc mechanism?

My conclusion was that it was better to exploit the benefits of bounded, single-purpose streams and simple file formats whenever possible.

There's also a middle way, where each *format* gets its own file. Then you wind up with fewer files, but you have to track field number state.

The nice thing is that packet-scoped plugins can be compatible with ALL of these configurations:

This way we can decouple the question of "how many files do I store my
things in" from "how is each thing encoded/decoded".  Maybe I want
frq/prx/skip all in one file, or maybe I want them in 3 different files.


Well said.

The second problem is how to share a term dictionary over a cluster. It would be nice to be able to plug modules into IndexReader that represent clusters of machines but that are dedicated to specific tasks: one cluster could be dedicated to fetching full documents and applying highlighting;
another cluster could be dedicated to scanning through postings and
finding/scoring hits; a third cluster could store the entire term dictionary
in RAM.

A centralized term dictionary held in RAM would be particularly handy for
sorting purposes.  The problem is that the file pointers of a term
dictionary are specific to indexes on individual machines.  A shared
dictionary in RAM would have to contain pointers for *all* clients, which
isn't really workable.

So, just how do you go about assembling task specific clusters? The stored documents cluster is easy, but the term dictionary and the postings are
hard.

Phew!  This is way beyond what I'm trying to solve now :)

Hmm. It doesn't look that difficult from my perspective. The problem seems reasonably well isolated and contained. But I've worked hard to make KS modular, so perhaps there's less distance left to travel.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to