Marvin Humphrey wrote:

Personally, I'm less interested in adding new features than I am in solidifying and improving the core.

The benefits I care about are:

  * Decouple Lucene from it's file format.
    o Make back-compatibility easier.
    o Make refactoring easier.
    o All the other goodness that comes with loose coupling.
  * Improve IR precision, by writing a Boolean Scorer that
    takes position into account, a la Brin/Page '98.
  * Decrease time to launch a Searcher from rest.
  * Simplify Lucene, conceptually.
    o Indexes would have three parts: Term dictionary,
      Postings, and Storage.
    o Each part could be pluggable, following this format:
      <header><object>+
      * The de-serialization for each object is determined by
        a plugin spec'd in the header.
      * It's probably better to have separate header and data
files.

3. Optional: Add a type-system for the payloads to make it
  easier to develop PostingsWriter/Reader plugins.

IMO, this should wait. It's going to be freakishly difficult to get this stuff to work and maintain the commitments that Doug has laid out for backwards compatibility. There's also going to be trade-offs, and so I'd anticipate contentious, interminable debate along the lines of the recent Java 1.4/1.5 thread once there's real code and it becomes clear who's lost a clock tick or two.

Actually, I think pushing this forward is going to be so difficult, that I'll be focusing my attentions on implementing it elsewhere.

I understand that backward compatibility is a big concern. Doug pointed
out, that Y.X+1 versions should be backward compatible to Y.X. The
things we talk about (fundamental change of index data structures,
plugins) will break the compatibility, so should be targeted for Lucene 3.

To have payloads in a earlier release 2.X, we could go a simpler way and
use the implementation I've done so far and which I'll finish soon. In the
following I'm going to describe this implementation in detail.

* File changes
  - Field Infos
    I'm using the 6th lowest order Bit of FieldBits, which is currently
    unused, to store whether payloads are enabled for a certain field.
  - Positions file
    For fields with disabled payloads, the format of the positions file
    does not change at all. If payloads are enabled, than a variable
    length payload is being stores with each position:

    ProxFile (.prx) --> <TermPositions>^TermCount
    TermPositions   --> <Positions>^DocFreq
    Positions       --> <PositionDelta, Payload>^Freq
    PositionDelta   --> VInt
Payload --> Byte+
    Encoding of the Payload:
    If the payload is only one byte long then
       - if the value of the byte is <128, then this byte is stored as is
       - if the value of the byte is >=128, then a byte 10000001 (0x81)
         is stored, followed by the payload byte itself
    If the payload is longer than one byte but <127 then
       - a byte (0x80 | length) is stored, followed by the payload bytes
    If the payload is length is >=127 then
- the payload_length-127 is stored as a VInt, followed by the payload
         bytes
    If the payload length is 0, then
       - one byte 0x80 is stored. This is being done to distinguish a
         payload with length=0 from a payload with length=1 and value=0
* API changes
  - org.apache.lucene.index.Payload
    Added this class with the following constructor and getter method:
    * public Payload(byte[] value);
    * public byte[] getValue();

  - org.apache.lucene.analysis.Token
    Added two new constructors and getter/setter:
    * public Token(String text, int start, int end, Payload payload);
    * public Token(String text, int start, int end, String typ,
                   Payload payload);
    * public Payload getPayload();
    * public void setPayload(Payload payload);


  - org.apache.lucene.document.Field
Added PayloadParameter.YES/.NO to indicate whether Field stores payloads
    and added new constructors to create a field with payloads enabled:
    * public Field(String name, String value, Store store, Index index,
                   TermVector termVector, PayloadParameter payloadParam);
    * public Field(String name, String value, Store store, Index index,
                   TermVector termVector, Payload payload);
    * public Field(String name, Reader reader, TermVector termVector,
                   PayloadParameter payloadParam);

    Furthermore:
    * public Payload getPayload();
    * public boolean isPayloadStored();

  - org.apache.lucene.index.TermPositions
    Added the new method:
    * public Payload getPayload() throws IOException;
Remark: In contrast to nextPosition(), this method does not move the pointer
            in the prox file. Therefore it should always be called after
            nextPosition().


So adding this payload feature to the Lucene core for a release 2.X
is not a big risk in my opinion for the following reasons:
  - API only extended
  - Lucene 2.X will be able to read an index created with an earlier
    version, because the Payload bit in FieldInfos will always be 0 then.
- Payloads are disabled by default. They will only be enabled by using the
    new API.
  - If Payloads are disabled, then Lucene 2.0 is able to read an index
created with Lucene 2.X, because the file formats don't change at all in
    that case.

So we could go ahead and add this to 2.X and keep working on the more
fundamental changes for Lucene 3. Sounds like a plan?



5. Develop new or extend existing PostingsWriter/Reader plugins for
  desired features like XML search, POS, multi-faceted search, ...

People will definitely want to scratch their own itches, but I'd argue that this stuff should start out private. And maybe stay that way!

I agree with that. We should focus on improving the Lucene core and start
offering a flexible payload mechanism, so that people can start developing
their own stuff. Later, if people submit good solutions, those might be
good candidates for contrib.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Regards,
 Michael Busch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to