Hi Michael,

Have a look at https://issues.apache.org/jira/browse/LUCENE-662

I am planning on starting on this soon (I know, I have been saying that for a while, but I really am.) At any rate, another set of eyes would be good and I would be interested in hearing how your version compares/works with this patch from Nicolas.

-Grant

On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:

Hi all,

currently it is not possible to add generic payloads to a posting list. However, this feature would be useful for various use cases. Some examples:
- XML search
to index XML documents and allow structured search (e.g. XPath) it is neccessary to store the depths of the terms
- part-of-speech
 payloads can be used to store the part of speech of a term occurrence
- term boost
for terms that occur e.g. in bold font a payload containing a boost value can be stored
- ...

The feature payloads has been requested and discussed a couple of times, e. g. in
- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In the latter thread I proposed a design a couple of months ago that adds the possibility to Lucene to store variable-length payloads inline in the posting list of a term. However, this design had some drawbacks: the already complex field API was extended and the payloads encoding was not optimal in terms of disk space. Furthermore, the overall Lucene runtime performance suffered due to the growth of the .prx file. In the meantime the patch LUCENE-687 (Lazy skipping on proximity file) was committed, which reduces the number of reads and seeks on the .prx file. This minimizes the performance degradation of a bigger .prx file. Also, LUCENE-695 (Improve BufferedIndexInput.readBytes() performance) was committed, that speeds up reading mid-size chunks of bytes, which is beneficial for payloads that are bigger than just a few bytes.

Some weeks ago I started working on an improved design which I would like to propose now. The new design simplifies the API extensions (the Field API remains unchanged) and uses less disk space in most use cases. Now there are only two classes that get new methods:
- Token.setPayload()
Use this method to add arbitrary metadata to a Token in the form of a byte[] array.
- TermPositions.getPayload()
 Use this method to retrieve the payload of a term occurrence.
The implementation is very flexible: the user does not have to enable payloads explicilty for a field and can add payloads to all, some or no Tokens. Due to the improved encoding those use cases are handled efficiently in terms of disk space.

Another thing I would like to point out is that this feature is backwards compatible, meaning that the file format only changes if the user explicitly adds payloads to the index. If no payloads are used, all data structures remain unchanged.

I'm going to open a new JIRA issue soon containing the patch and details about implementation and file format changes.

One more comment: It is a rather big patch and this is the initial version, so I'm sure there will be a lot of discussions. I would like to encourage people who consider this feature as useful to try it out and give me some feedback about possible improvements.

Best regards,
- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to