Hi everyone,

I'm working for IBM and started recently looking into Lucene.
I am very interested in the topic "flexible indexing / payloads",
that was discussed a couple of times in the last two months. I
did some investigation in the mailing lists, and found several
threads about this topic. Those threads didn't really lead to a
conclusion. That's my reason for starting this new thread: I hope
to get an understanding about:
  - Who is working on this feature?
  - Is there a concrete design?
  - Which functions/changes will the implementation include?
Furthermore, I would like to describe the work I did so far on
this feature.


To sum up the recent discussions, I'm going to list the different
threads about this topic:

--> There is a page in the Lucene Wiki to plan / track this topic:
   http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

--> May 08, 2006 - May 10, 2006 http://wiki.apache.org/jakarta-lucene/ConversationsBetweenDougMarvinAndGrant

   - Grant Ingersoll mentions, that he is interested in working
     on this topic.
   - Doug suggests to have docs, frequencies, positions, and
     norms in one postings-file (freqs, pos, and norms optional).
     A suggested file format for such a postings-file can be found
     on the mentioned Wiki page.

--> May 28, 2006 - May 31, 2006 http://www.gossamer-threads.com/lists/lucene/java-dev/36039?search_string=lucene%20planning;#36039

   - Nadav Har'El suggests to have arbitrary data associated with
     each posting, i. e. a variable-length payload stored with
     each position, an idea Nadav and I discussed earlier. Doug
     voted +1 for this idea.

--> May 31, 2006 - Jun 2, 2006 http://www.gossamer-threads.com/lists/lucene/java-dev/36210?search_string=flexible%20indexing;#36210 - Marvin Humphrey talks about pluggable PostingsWriter/Reader,
     to make the postings file customizable. Marvin goes a step
     further and suggests to use plugins also for other index files.
I have the feeling, that many people are interested in having a
flexible index format. There are already various use cases:
  - Efficient parametric search
  - XML search
  - Part Of Speech (POS) annotations with each position
  - Multi-faceted search
  - ...

But I also have the feeling, that no clear course of action has
been defined yet, because this issue is quite complex since
it is not so easy to generalize the index data structures to
satisfy all demands/use cases, while maintaining the
straightforwardness of Lucene.


In the following I would like to describe the work I did so far
on this issue and propose a strategy on how to work on it in the
future to get the complexity under control.

I have made a prototype implementation of payloads. In my approach
I leave the frequency file as is and only change the positions file.
I can store a variable length payload (byte[]) with each position.
The payloads can be enabled/disabled on field level. The API changes
include:
 - new Field constructor, that takes a Payload as additional data
 - a Token stores a Payload, so an analyzer can produce tokens with
   arbitrary payloads
 - TermPositions got a getPayload method()

This prototype works very well, and we use it to play around with
multi-faceted search. But I think I should go a bit further, and
merge the frequency and position files into a single postings file,
which seemed to be the opinion in the mailing list threads.


I would suggest to split up the whole work to have smaller work items
and to have clearly defined milestones. Thus I suggest the
following steps:
1. Introduce postings file with the following format:
  <DocDelta, Payload>*
    DocDelta --> VInt
    DocDelta/2 is the difference between this document number and
    the previous document number.

    Payload --> Byte, if DocDelta is even
    Payload --> <Payload_Length, Payload_Data>, if DocDelta is odd
      Payload_Length --> VInt
      Payload_Data   --> Byte^Payload_Length

  Furthermore, it should be possible to enabled/disable payloads
  on field level.

2. Add multilevel skipping (tree structure) for the postings-file.
  One-level skipping, as being used now in Lucene, is probably
  not efficient enough for the new postings file, because it can
  be very big. Question: Should we include skipping information
  directly in the postings file, or should we introduce a new file
  containing the skipping infos? I think it should improve cache
  performance to have the skip tree in a different file.

3. Optional: Add a type-system for the payloads to make it
  easier to develop PostingsWriter/Reader plugins.

4. Make the PostingsWriter/Reader pluggable and develop default
  PostingsWriter/Reader plugins, that store frequencies, positions,
  and norms as payloads in the postings file. Should be configurable,
  to enable the different options Doug suggested:

  a. <doc>+
  b. <doc, boost>+
  c. <doc, freq, <position>+ >+
  d. <doc, freq, <position, boost>+ >+

5. Develop new or extend existing PostingsWriter/Reader plugins for
  desired features like XML search, POS, multi-faceted search, ...


Please let me know what you think about my suggestions. If people
like this approach, then I can add the information to the Wiki
planning page and start working on it.


Best Regards,
 Michael Busch


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to