Hi Michael,
Have a look at https://issues.apache.org/jira/browse/LUCENE-662
I am planning on starting on this soon (I know, I have been saying
that for a while, but I really am.) At any rate, another set of eyes
would be good and I would be interested in hearing how your version
compares/works with this patch from Nicolas.
-Grant
On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:
Hi all,
currently it is not possible to add generic payloads to a posting
list. However, this feature would be useful for various use cases.
Some examples:
- XML search
to index XML documents and allow structured search (e.g. XPath) it
is neccessary to store the depths of the terms
- part-of-speech
payloads can be used to store the part of speech of a term occurrence
- term boost
for terms that occur e.g. in bold font a payload containing a
boost value can be stored
- ...
The feature payloads has been requested and discussed a couple of
times, e. g. in
- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409
In the latter thread I proposed a design a couple of months ago
that adds the possibility to Lucene to store variable-length
payloads inline in the posting list of a term. However, this design
had some drawbacks: the already complex field API was extended and
the payloads encoding was not optimal in terms of disk space.
Furthermore, the overall Lucene runtime performance suffered due to
the growth of the .prx file. In the meantime the patch LUCENE-687
(Lazy skipping on proximity file) was committed, which reduces the
number of reads and seeks on the .prx file. This minimizes the
performance degradation of a bigger .prx file. Also, LUCENE-695
(Improve BufferedIndexInput.readBytes() performance) was committed,
that speeds up reading mid-size chunks of bytes, which is
beneficial for payloads that are bigger than just a few bytes.
Some weeks ago I started working on an improved design which I
would like to propose now. The new design simplifies the API
extensions (the Field API remains unchanged) and uses less disk
space in most use cases. Now there are only two classes that get
new methods:
- Token.setPayload()
Use this method to add arbitrary metadata to a Token in the form
of a byte[] array.
- TermPositions.getPayload()
Use this method to retrieve the payload of a term occurrence.
The implementation is very flexible: the user does not have to
enable payloads explicilty for a field and can add payloads to all,
some or no Tokens. Due to the improved encoding those use cases are
handled efficiently in terms of disk space.
Another thing I would like to point out is that this feature is
backwards compatible, meaning that the file format only changes if
the user explicitly adds payloads to the index. If no payloads are
used, all data structures remain unchanged.
I'm going to open a new JIRA issue soon containing the patch and
details about implementation and file format changes.
One more comment: It is a rather big patch and this is the initial
version, so I'm sure there will be a lot of discussions. I would
like to encourage people who consider this feature as useful to try
it out and give me some feedback about possible improvements.
Best regards,
- Michael
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]