Flexible indexing (was: Re: [jira] Commented: (LUCENE-755) Payloads)

Michael Busch Sat, 10 Mar 2007 15:28:36 -0800

Hi Grant,

LUCENE-662 contains different ideas:
1) introduction of an index format concept
2) extensibility of the store reader/writer
3) New: extensibility of the posting reader/writer

IMO we should split this up, that way it will be easier to developsmaller patches that focus on adding one particular feature. However, itis important to plan the API, so that different patches (like payloads)fit in. On the other hand it will be nearly impossible to plan an APIthat is perfect and won't change anymore without having the actualimplementions. Therefore I suggest the following steps:

a) define the different work items of flexible indexing
b) plan a API rougly that fits with all items

c) develop the different items, commit them but with either protected oras experimental marked APIsd) after all items are completed and committed (and hopefully tested bysome brave community members ;)) finalize the API and removeexperimental comments (or make public)


Let's start with a):

The following items come to my mind (please feel free toadd/remove/complain):- Introduce index-level metadata. Preferable in XML format, so it willbe human readable. Later on, we can store information about the indexformat in this file, like the codecs that are used to store the data. Weshould also make this public, so that users can store their own indexmetadata. (Remark: LUCENE-783 is also a neat idea, we can write one xmlparser for both items)

- Introduce index format. Nicolas has already written a lot of code inthis regard! It will include different interfaces for the differentextension points (FieldsFormat, PostingFormat, DictionaryFormat). We canuse the xml file to store which actual formats are used in thecorresponding index.

- Implement the different extensions. LUCENE-662 includes an extensibleFieldsWriter, LUCENE-755 the payloads feature. Doug and Ning suggestedalready nice interfaces for PostingFormat and DictionaryFormat in thepayloads thread on java-dev.

- Write standard implementations for the different formats. In the wikiis already a list of desired posting formats.

I suggest we should finalize this list first. Then I will add this listto the wiki under Flexible indexing and gather information from thedifferent discussions on java-dev which I already mentioned. Then weshould discuss the different items of this list in greater depth andplan the APIs (step b) ). And then we're already ready for step c) andthe fun starts :-).


Michael


Grant Ingersoll wrote:

I think it makes the most sense to get flexible indexing in first, andthen make payloads work with it. On the other hand, payloads lookedpretty straightforward to me, whereas FI is much more involved (or atleast it feels that way).
As it is right now, I would like to at least review the two patchesand start thinking about them in greater depth. The payloads patchneeds a little more work in that I want to integrate it with theSimilarity class so people can customize their scoring.
-Grant

On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:
[https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479841]
Nicolas Lalevée commented on LUCENE-755:
----------------------------------------

Grant>
The patch I have propsed here has no dependency on LUCENE-662, I just"imported" some ideas from it and put them there. Since theLUCENE-662 have involved, the patches will probably make conflicts.The best to use here is Michael's one. I think it won't conflit withLUCENE-662. And if both are intended to be commited, then the best isto commit the both seperately and redo the work I have done with theprovided patch (I remember that it was quite easy).
Payloads
--------

                Key: LUCENE-755
                URL: https://issues.apache.org/jira/browse/LUCENE-755
            Project: Lucene - Java
         Issue Type: New Feature
         Components: Index
           Reporter: Michael Busch
        Assigned To: Michael Busch
        Attachments: payload.patch, payloads.patch
This patch adds the possibility to store arbitrary metadata(payloads) together with each position of a term in its postinglists. A while ago this was discussed on the dev mailing list, whereI proposed an initial design. This patch has a much improved designwith modifications, that make this new feature easier to use andmore efficient.A payload is an array of bytes that can be stored inline in theProxFile (.prx). Therefore this patch provides low-level APIs tosimply store and retrieve byte arrays in the posting lists in anefficient way.
API and Usage
------------------------------
The new class index.Payload is basically just a wrapper around abyte[] array together with int variables for offset and length. So auser does not have to create a byte array for every payload, but canrather allocate one array for all payloads of a document and provideoffset and length information. This reduces object allocations onthe application side.In order to store payloads in the posting lists one has to provide aTokenStream or TokenFilter that produces Tokens with payloads. Iadded the following two methods to the Token class:
  /** Sets this Token's payload. */
  public void setPayload(Payload payload);

  /** Returns this Token's payload. */
  public Payload getPayload();
In order to retrieve the data from the index the interfaceTermPositions now offers two new methods:
  /** Returns the payload length of the current term position.
   *  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   *  the first time.
   *
   * @return length of the current payload in number of bytes
   */
  int getPayloadLength();

  /** Returns the payload data of the current term position.
   * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   * the first time.
   * This method must not be called more than once after each call
   * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded lazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.
   *
   * @param data the array into which the data of this payload is to be
* stored, if it is big enough; otherwise, a newbyte[] array
   *             is allocated for this purpose.
* @param offset the offset in the array into which the data ofthis payload
   *               is to be stored.
   * @return a byte[] array containing the data of this payload
   * @throws IOException
   */
  byte[] getPayload(byte[] data, int offset) throws IOException;
Furthermore, this patch indroduces the new methodIndexOutput.writeBytes(byte[] b, int offset, int length). So farthere was only a writeBytes()-method without an offset argument.
Implementation details
------------------------------
- One field bit in FieldInfos is used to indicate if payloads areenabled for a field. The user does not have to enable payloads for afield, this is done automatically:* The DocumentWriter enables payloads for a field, if one oremore Tokens carry payloads.* The SegmentMerger enables payloads for a field during a merge,if payloads are enabled for that field in one or more segments.- Backwards compatible: If payloads are not used, then the formatsof the ProxFile and FreqFile don't change- Payloads are stored inline in the posting list of a term in theProxFile. A payload of a term occurrence is stored right after itsPositionDelta.- Same-length compression: If payloads are enabled for a field, thenthe PositionDelta is shifted one bit. The lowest bit is used toindicate whether the length of the following payload is storedexplicitly. If not, i. e. the bit is false, then the payload has thesame length as the payload of the previous term occurrence.- In order to support skipping on the ProxFile the length of thepayload at every skip point has to be known. Therefore the payloadlength is also stored in the skip list located in the FreqFile. Herethe same-length compression is also used: The lowest bit of DocSkipis used to indicate if the payload length is stored for a SkipDatumor if the length is the same as in the last SkipDatum.- Payloads are loaded lazily. When a user callsTermPositions.nextPosition() then only the position and the payloadlength is loaded from the ProxFile. If the user calls getPayload()then the payload is actually loaded. If getPayload() is not calledbefore nextPosition() is called again, then the payload data is justskipped.
Changes of file formats
------------------------------
- FieldInfos (.fnm)
The format of the .fnm file does not change. The only change is theuse of the sixth lowest-order bit (0x20) of the FieldBits. If thisbit is set, then payloads are enabled for the corresponding field.
- ProxFile (.prx)
ProxFile (.prx) -->  <TermPositions>^TermCount
TermPositions   --> <Positions>^DocFreq
Positions       --> <PositionDelta, Payload?>^Freq
Payload         --> <PayloadLength?, PayloadData>
PositionDelta   --> VInt
PayloadLength   --> VInt
PayloadData     --> byte^PayloadLength
For payloads disabled (unchanged):
PositionDelta is the difference between the position of the currentoccurrence in the document and the previous occurrence (or zero, ifthis is the first occurrence in this document).
For Payloads enabled:
PositionDelta/2 is the difference between the position of thecurrent occurrence in the document and the previous occurrence. IfPositionDelta is odd, then PayloadLength is stored. If PositionDeltais even, then the length of the current payload equals the length ofthe previous payload and thus PayloadLength is omitted.
- FreqFile (.frq)
SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
PayloadLength --> VInt
For payloads disabled (unchanged):
DocSkip records the document number before every SkipInterval thdocument in TermFreqs. Document numbers are represented asdifferences from the previous value in the sequence.
For payloads enabled:
DocSkip/2 records the document number before every SkipInterval thdocument in TermFreqs. If DocSkip is odd, then PayloadLengthfollows. If DocSkip is even, then the length of the payload at thecurrent skip point equals the length of the payload at the last skippoint and thus PayloadLength is omitted.
This encoding is space efficient for different use cases:
* If only some fields of an index have payloads, then there's nospace overhead for the fields with payloads disabled.* If the payloads of consecutive term positions have the samelength, then the length only has to be stored once for every term.This should be a common case, because users probably use the sameformat for all payloads.* If only a few terms of a field have payloads, then we don'twaste much space because we benefit again from thesame-length-compression since we only have to store the length zerofor the empty payloads once per term.
All unit tests pass.
--This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Flexible indexing (was: Re: [jira] Commented: (LUCENE-755) Payloads)

Reply via email to