I have a working copy using the news groups in the issue, but need to split it out into a shorter version, as suggested by some earlier threads concerning the issue. I hope to get to it committed this week, if not early next week.

-Grant

On Oct 18, 2006, at 4:26 PM, Steven Parkes wrote:

Any idea when you're going to post a snapshot of your 675 stuff, Grant?

-----Original Message-----
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 18, 2006 1:18 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-687) Performance improvement:
Lazy skipping on proximity file

Can you share your performance test as well as the results?

http://issues.apache.org/jira/browse/LUCENE-675

Thanks,
Grant

On Oct 18, 2006, at 3:41 PM, Michael Busch (JIRA) wrote:

    [ http://issues.apache.org/jira/browse/LUCENE-687?
page=comments#action_12443343 ]

Michael Busch commented on LUCENE-687:
--------------------------------------

Hi Yonik,

thanks for the quick reply! I'm going to do performance tests and
will give you some numbers soon.

Performance improvement: Lazy skipping on proximity file
--------------------------------------------------------

                Key: LUCENE-687
                URL: http://issues.apache.org/jira/browse/LUCENE-687
            Project: Lucene - Java
         Issue Type: Improvement
         Components: Index
           Reporter: Michael Busch
           Priority: Minor
        Attachments: lazy_prox_skipping.patch


Hello,
I'm proposing a patch here that changes
org.apache.lucene.index.SegmentTermPositions to avoid unnecessary
skips and reads on the proximity stream. Currently a call of next
() or seek(), which causes a movement to a document in the freq
file also moves the prox pointer to the posting list of that
document.  But this is only necessary if actual positions have to
be retrieved for that particular document.
Consider for example a phrase query with two terms: the freq
pointer for term 1 has to move to document x to answer the
question if the term occurs in that document. But *only* if term 2
also matches document x, the positions have to be read to figure
out if term 1 and term 2 appear next to each other in document x
and thus satisfy the query.
A move to the posting list of a document can be quite expensive.
It has to be skipped to the last skip point before that document
and then the documents between the skip point and the desired
document have to be scanned, which means that the VInts of all
positions of those documents have to be read and decoded.
An improvement is to move the prox pointer lazily to a document
only if nextPosition() is called. This will become even more
important in the future when the size of the proximity file
increases (e. g. by adding payloads to the posting lists).
My patch implements this lazy skipping. All unit tests pass.
I also attach a new unit test that works as follows:
Using a RamDirectory an index is created and test docs are added.
Then the index is optimized to make sure it only has a single
segment. This is important, because IndexReader.open() returns an
instance of SegmentReader if there is only one segment in the
index. The proxStream instance of SegmentReader is package
protected, so it is possible to set proxStream to a different
object. I am using a class called SeeksCountingStream that extends
IndexInput in a way that it is able to count the number of
invocations of seek().
Then the testcase searches the index using a PhraseQuery "term1
term2". It is known how many documents match that query and the
testcase can verify that seek() on the proxStream is not called
more often than number of search hits.
Example:
Number of docs in the index: 500
Number of docs that match the query "term1 term2": 5
Invocations of seek on prox stream (old code): 29
Invocations of seek on prox stream (patched version): 5
- Michael

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/
Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/
software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to