Re: Payloads

Grant Ingersoll Wed, 20 Dec 2006 06:32:04 -0800

Hi Michael,

Have a look at https://issues.apache.org/jira/browse/LUCENE-662

I am planning on starting on this soon (I know, I have been sayingthat for a while, but I really am.) At any rate, another set of eyeswould be good and I would be interested in hearing how your versioncompares/works with this patch from Nicolas.


-Grant

On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:

Hi all,
currently it is not possible to add generic payloads to a postinglist. However, this feature would be useful for various use cases.Some examples:
- XML search
to index XML documents and allow structured search (e.g. XPath) itis neccessary to store the depths of the terms
- part-of-speech
 payloads can be used to store the part of speech of a term occurrence
- term boost
for terms that occur e.g. in bold font a payload containing aboost value can be stored
- ...
The feature payloads has been requested and discussed a couple oftimes, e. g. in
- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409
In the latter thread I proposed a design a couple of months agothat adds the possibility to Lucene to store variable-lengthpayloads inline in the posting list of a term. However, this designhad some drawbacks: the already complex field API was extended andthe payloads encoding was not optimal in terms of disk space.Furthermore, the overall Lucene runtime performance suffered due tothe growth of the .prx file. In the meantime the patch LUCENE-687(Lazy skipping on proximity file) was committed, which reduces thenumber of reads and seeks on the .prx file. This minimizes theperformance degradation of a bigger .prx file. Also, LUCENE-695(Improve BufferedIndexInput.readBytes() performance) was committed,that speeds up reading mid-size chunks of bytes, which isbeneficial for payloads that are bigger than just a few bytes.
Some weeks ago I started working on an improved design which Iwould like to propose now. The new design simplifies the APIextensions (the Field API remains unchanged) and uses less diskspace in most use cases. Now there are only two classes that getnew methods:
- Token.setPayload()
Use this method to add arbitrary metadata to a Token in the formof a byte[] array.
- TermPositions.getPayload()
 Use this method to retrieve the payload of a term occurrence.
The implementation is very flexible: the user does not have toenable payloads explicilty for a field and can add payloads to all,some or no Tokens. Due to the improved encoding those use cases arehandled efficiently in terms of disk space.
Another thing I would like to point out is that this feature isbackwards compatible, meaning that the file format only changes ifthe user explicitly adds payloads to the index. If no payloads areused, all data structures remain unchanged.
I'm going to open a new JIRA issue soon containing the patch anddetails about implementation and file format changes.
One more comment: It is a rather big patch and this is the initialversion, so I'm sure there will be a lot of discussions. I wouldlike to encourage people who consider this feature as useful to tryit out and give me some feedback about possible improvements.
Best regards,
- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payloads

Reply via email to