[ http://issues.apache.org/jira/browse/LUCENE-687?page=all ]
Yonik Seeley resolved LUCENE-687.
---------------------------------
Fix Version/s: 2.1
Resolution: Fixed
Assignee: Yonik Seeley
Reviewed and committed. Thanks Michael!
> Performance improvement: Lazy skipping on proximity file
> --------------------------------------------------------
>
> Key: LUCENE-687
> URL: http://issues.apache.org/jira/browse/LUCENE-687
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael Busch
> Assigned To: Yonik Seeley
> Priority: Minor
> Fix For: 2.1
>
> Attachments: lazy_prox_skipping.patch
>
>
> Hello,
> I'm proposing a patch here that changes
> org.apache.lucene.index.SegmentTermPositions to avoid unnecessary skips and
> reads on the proximity stream. Currently a call of next() or seek(), which
> causes a movement to a document in the freq file also moves the prox pointer
> to the posting list of that document. But this is only necessary if actual
> positions have to be retrieved for that particular document.
> Consider for example a phrase query with two terms: the freq pointer for term
> 1 has to move to document x to answer the question if the term occurs in that
> document. But *only* if term 2 also matches document x, the positions have to
> be read to figure out if term 1 and term 2 appear next to each other in
> document x and thus satisfy the query.
> A move to the posting list of a document can be quite expensive. It has to be
> skipped to the last skip point before that document and then the documents
> between the skip point and the desired document have to be scanned, which
> means that the VInts of all positions of those documents have to be read and
> decoded.
> An improvement is to move the prox pointer lazily to a document only if
> nextPosition() is called. This will become even more important in the future
> when the size of the proximity file increases (e. g. by adding payloads to
> the posting lists).
> My patch implements this lazy skipping. All unit tests pass.
> I also attach a new unit test that works as follows:
> Using a RamDirectory an index is created and test docs are added. Then the
> index is optimized to make sure it only has a single segment. This is
> important, because IndexReader.open() returns an instance of SegmentReader if
> there is only one segment in the index. The proxStream instance of
> SegmentReader is package protected, so it is possible to set proxStream to a
> different object. I am using a class called SeeksCountingStream that extends
> IndexInput in a way that it is able to count the number of invocations of
> seek().
> Then the testcase searches the index using a PhraseQuery "term1 term2". It is
> known how many documents match that query and the testcase can verify that
> seek() on the proxStream is not called more often than number of search hits.
> Example:
> Number of docs in the index: 500
> Number of docs that match the query "term1 term2": 5
> Invocations of seek on prox stream (old code): 29
> Invocations of seek on prox stream (patched version): 5
> - Michael
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]