Re: Flexible index format / Payloads Cont'd

Marvin Humphrey Thu, 29 Jun 2006 16:47:56 -0700


On Jun 29, 2006, at 2:22 PM, Michael Busch wrote:

  - Is there a concrete design?


Not that I am aware of.

I have the feeling, that many people are interested in having a
flexible index format. There are already various use cases:
  - Efficient parametric search

This comes at the expense of a significant file size increase andperformance hit. Think a book index that not only lists page numberbut also category.


  axle => 3, 67, 89, 244

vs...

  axle => 3 cars, 67 cars, 89 trucks, 244 cars

Scanning through the latter is going to be more expensive. It mightbe worth it in specific cases, but it's not the long-hoped-forpanacea that would give Lucene all the features of an RDBMS withoutincurring any costs. :)

  - Part Of Speech (POS) annotations with each position

This is an example of where it might be worth it... to Grant, andGrant only.

Personally, I'm less interested in adding new features than I am insolidifying and improving the core.


The benefits I care about are:

  * Decouple Lucene from it's file format.
    o Make back-compatibility easier.
    o Make refactoring easier.
    o All the other goodness that comes with loose coupling.
  * Improve IR precision, by writing a Boolean Scorer that
    takes position into account, a la Brin/Page '98.
  * Decrease time to launch a Searcher from rest.
  * Simplify Lucene, conceptually.
    o Indexes would have three parts: Term dictionary,
      Postings, and Storage.
    o Each part could be pluggable, following this format:
      <header><object>+
      * The de-serialization for each object is determined by
        a plugin spec'd in the header.
      * It's probably better to have separate header and data
        files.

I would suggest to split up the whole work to have smaller work items
and to have clearly defined milestones. Thus I suggest the
following steps:
1. Introduce postings file with the following format:
  <DocDelta, Payload>*
    DocDelta --> VInt
    DocDelta/2 is the difference between this document number and
    the previous document number.
    Payload --> Byte, if DocDelta is even
    Payload --> <Payload_Length, Payload_Data>, if DocDelta is odd
      Payload_Length --> VInt
      Payload_Data   --> Byte^Payload_Length

Good stuff! Now, if you put that whole thing in a plugin, you'llhave the chance to refine it even after deployment if you think of away to improve it -- by adding another plugin. And, if it becomestoo unwieldy and inflexible, you're not stuck with it.

  Furthermore, it should be possible to enabled/disable payloads
  on field level.

Maybe each field should get its own file, and its own encoding/decoding object. Then you don't have to check each object/record tosee which codec to use.

Or maybe there should be an array of codec objects, indexed by fieldnumber.


  fieldNum = input->readVint();
  decoders[fieldNum].read(input);

2. Add multilevel skipping (tree structure) for the postings-file.
  One-level skipping, as being used now in Lucene, is probably
  not efficient enough for the new postings file, because it can
  be very big. Question: Should we include skipping information
  directly in the postings file, or should we introduce a new file
  containing the skipping infos? I think it should improve cache
  performance to have the skip tree in a different file.

Interesting. I think I'd punt and leave it up to the plugin. Maybeyou'd have an extra large header if there was a lot of stuff to becached.

3. Optional: Add a type-system for the payloads to make it
  easier to develop PostingsWriter/Reader plugins.

IMO, this should wait. It's going to be freakishly difficult to getthis stuff to work and maintain the commitments that Doug has laidout for backwards compatibility. There's also going to be trade-offs, and so I'd anticipate contentious, interminable debate alongthe lines of the recent Java 1.4/1.5 thread once there's real codeand it becomes clear who's lost a clock tick or two.

Actually, I think pushing this forward is going to be so difficult,that I'll be focusing my attentions on implementing it elsewhere.

4. Make the PostingsWriter/Reader pluggable and develop default
  PostingsWriter/Reader plugins, that store frequencies, positions,
  and norms as payloads in the postings file. Should be configurable,
  to enable the different options Doug suggested:
  a. <doc>+
  b. <doc, boost>+
  c. <doc, freq, <position>+ >+
  d. <doc, freq, <position, boost>+ >+


Got any ideas as to how the Field constructors should look?

5. Develop new or extend existing PostingsWriter/Reader plugins for
  desired features like XML search, POS, multi-faceted search, ...

People will definitely want to scratch their own itches, but I'dargue that this stuff should start out private. And maybe stay thatway!


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible index format / Payloads Cont'd

Reply via email to