[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702343#action_12702343 ]
Shai Erera commented on LUCENE-1593: ------------------------------------ Few updates (it's been long since I posted on this issue): * I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something). The only reason to do it was to get rid of the 'if (queueFull)' check in every collect. However, it turned out that pre-populating the queue for TFC with sentinel values is unreliable. Reason is, if I want to get rid of that 'if', I don't want to add any 'if' to FieldComparator.compare, so the sentinel values should be something like MIN_VALUE or MAX_VALUE (depends on the value of 'reverse'). However, someone can set a field value to either of these values (and there are tests that do), so I need to check if FC if the current value is a sentinel, which adds that 'if' back, only worse - it's now executed in every compare() call. Unless I missed something, I don't think that's possible, or at least worth the effort (to get rid of one 'if'). ** BTW, in TSDC I use Float.NEG_INF as a sentinel value. This might be broken if a Scorer decided to return that value, in which case pre-populating the queue will not work either. I think it is still safe in TSDC, but want to get your opinion. * I changed Sort.setSort() to not add SortField.FIELD_DOC, as suggested by Mike. But then TestSort.testMultiSort failed. So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher). ** MultiSearcher's search(Weight, Filter, int, Sort) method executes a search against all its Searchables, sequentially. The test adds documents to two indexes, odd and even, so that the odd index is added two documents (A and E) with the value "5" in docs 0 and 2, and the even index is added one doc (B) with value "5" in index 0. When I use the 2 SortField version (2nd sorts by doc), the output is ABE, since the compare by doc uses the searcher's doc Ids (0, 0, 2) and B is always less than E and so preferred, even though its 'fixed' doc Id is 7 (it appears in the 2nd Searcher). When I use the 1 SortField version, the result is AEB, since B's fixed doc Id is 7, and now the code breaks a tie on their *true* doc Id. I hope you were able to follow my description. That's why I don't know which one is the true output ABE or AEB. I tend to say AEB, since those are the true doc Ids the application will get in the end. Few more comments: # TestSort.runMultiSort contains 3 versions of this test: #* One that sorts by the INT value only, relying on the doc Id tie breaking. You can see the output was not determined to be exact and so the pattern included is [ABE]{3}, as if the test's output is not predicted. I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code. #* The second explicitly sets INT followed by DOC Sorts, and expects [AB]{2}E. #* The third relies on setSort's adding DOC, so it expects the same [AB]{2}E. # The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but doesn't update the DOC's field value the same as it does to the scoreDoc.doc value (adding the current Searcher's start). Again, whatever the right output is, changing Sort to not include SortField.FIELD_DOC might result in someone experiencing a change in behavior (if he used MultiSearcher), even if that behavior is buggy. What do you think? > Optimizations to TopScoreDocCollector and TopFieldCollector > ----------------------------------------------------------- > > Key: LUCENE-1593 > URL: https://issues.apache.org/jira/browse/LUCENE-1593 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Shai Erera > Fix For: 2.9 > > > This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code > to remove unnecessary checks. The plan is: > # Ensure that IndexSearcher returns segements in increasing doc Id order, > instead of numDocs(). > # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs > will always have larger ids and therefore cannot compete. > # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) > and remove the check if reusableSD == null. > # Also move to use "changing top" and then call adjustTop(), in case we > update the queue. > # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" > for the last SortField. But, doing so should not be necessary (since we > already break ties by docID), and is in fact less efficient (once the above > optimization is in). > # Investigate PQ - can we deprecate insert() and have only > insertWithOverflow()? Add a addDummyObjects method which will populate the > queue without "arranging" it, just store the objects in the array (this can > be used to pre-populate sentinel values)? > I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org