[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

Shai Erera (JIRA) Fri, 24 Apr 2009 05:45:55 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702343#action_12702343
 ]


Shai Erera commented on LUCENE-1593:
------------------------------------

Few updates (it's been long since I posted on this issue):
* I tried to move to pre-populate the queue in TFC, but that proved to be 
impossible (unless I missed something). The only reason to do it was to get rid 
of the 'if (queueFull)' check in every collect. However, it turned out that 
pre-populating the queue for TFC with sentinel values is unreliable. Reason is, 
if I want to get rid of that 'if', I don't want to add any 'if' to 
FieldComparator.compare, so the sentinel values should be something like 
MIN_VALUE or MAX_VALUE (depends on the value of 'reverse'). However, someone 
can set a field value to either of these values (and there are tests that do), 
so I need to check if FC if the current value is a sentinel, which adds that 
'if' back, only worse - it's now executed in every compare() call. Unless I 
missed something, I don't think that's possible, or at least worth the effort 
(to get rid of one 'if').
** BTW, in TSDC I use Float.NEG_INF as a sentinel value. This might be broken 
if a Scorer decided to return that value, in which case pre-populating the 
queue will not work either. I think it is still safe in TSDC, but want to get 
your opinion.

* I changed Sort.setSort() to not add SortField.FIELD_DOC, as suggested by 
Mike. But then TestSort.testMultiSort failed. So I debugged it and either the 
test works fine but there's a bug in MultiSearcher, or the test does not work 
fine (or should be adjusted) but then we'll be changing runtime behavior of 
Sort (e.g., everyone who used setSort might get undesired behavior, only in 
MultiSearcher).
** MultiSearcher's search(Weight, Filter, int, Sort) method executes a search 
against all its Searchables, sequentially. The test adds documents to two 
indexes, odd and even, so that the odd index is added two documents (A and E) 
with the value "5" in docs 0 and 2, and the even index is added one doc (B) 
with value "5" in index 0. When I use the 2 SortField version (2nd sorts by 
doc), the output is ABE, since the compare by doc uses the searcher's doc Ids 
(0, 0, 2) and B is always less than E and so preferred, even though its 'fixed' 
doc Id is 7 (it appears in the 2nd Searcher). When I use the 1 SortField 
version, the result is AEB, since B's fixed doc Id is 7, and now the code 
breaks a tie on their *true* doc Id.

I hope you were able to follow my description. That's why I don't know which 
one is the true output ABE or AEB. I tend to say AEB, since those are the true 
doc Ids the application will get in the end. Few more comments:
# TestSort.runMultiSort contains 3 versions of this test:
#* One that sorts by the INT value only, relying on the doc Id tie breaking. 
You can see the output was not determined to be exact and so the pattern 
included is [ABE]{3}, as if the test's output is not predicted. I think that's 
wrong in the first place, we should always have predicted tests output, since 
we're not involving any randomness in that code.
#* The second explicitly sets INT followed by DOC Sorts, and expects [AB]{2}E.
#* The third relies on setSort's adding DOC, so it expects the same [AB]{2}E.
# The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but 
doesn't update the DOC's field value the same as it does to the scoreDoc.doc 
value (adding the current Searcher's start).

Again, whatever the right output is, changing Sort to not include 
SortField.FIELD_DOC might result in someone experiencing a change in behavior 
(if he used MultiSearcher), even if that behavior is buggy.

What do you think?

> Optimizations to TopScoreDocCollector and TopFieldCollector
> -----------------------------------------------------------
>
>                 Key: LUCENE-1593
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1593
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

Reply via email to