Re: Payload Loading and Reloading

Grant Ingersoll Thu, 29 Nov 2007 15:02:14 -0800

The use case I have is for Lucene-1001, so the caching is going tohappen somewhere in Lucene, not necessarily the application. I thinkcaching it in SegTermPos. is the simplest, but I will have to look atthe alternatives. It is particularly problematic in the Near Spanscase (ordered and unordered) but maybe I can address it there.

As for the cost of the seeks, why can't we just document that this iswhat is going on and discourage people from doing it? However, ifthey really feel they need to call it again, why not let them? Afterall, it's still cheaper than going back to the beginning and startingover. Just b/c you can call something twice doesn't mean you must.


-Grant

On Nov 29, 2007, at 5:34 PM, Michael Busch wrote:

I designed the API with this limitation intentionally to prevent users
from thinking that they can call TermPositions.getPayload() more than
once with no costs.

If we allow to call it more often than once then we have to seekback in

the posting stream. Even if this is just a seek in the underlying
IndexInput buffer, we still have to perform an arraycopy from that
buffer to the array that getPayload() returns. If the beginning of the
payload is already outside the current buffer, then a seek on the HD
will happen in addition, which is even more expensive.

So I'd like to keep the API as is. An application should always beable

to buffer a payload byte[] array if it needs to access it more than
once. For convenience, user could also create a very simple

Termpositions decorator that caches the most recently loaded payloadand

allows calling getPayload() more than once.
However, I hesitate to add such a payload caching to
SegmentTermPositions, because the size of the payloads is
application-specific and so should the policy be that grows/shrinks a
caching byte[] array.

-Michael

Grant Ingersoll wrote:

In working on LUCENE-1001, things are getting a bit complicated with
loading payloads in overlapping spans (which causes the dreaded Can't
load payload more than once error).
This got me thinking about why we need the rule that payloads canonlybe loaded once. I forget the reasoning behind this. Can we juststorewhere the current position before we load the payload and then seekback
to that point if we need to load the payload again?  I suppose in the
case of really large payloads the seek on the IndexInput could be
expensive, but in reality, most payloads aren't likely to be morethan a
few bytes, right?  There also seems to be some interactions with the
lazy skipping that I haven't quite pinned down yet.  What else am I
forgetting?
The other alternative I can think of is I could cache the payloads,but
that seems unwieldy too.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload Loading and Reloading

Reply via email to