[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

Michael McCandless (JIRA) Tue, 28 Apr 2009 03:17:53 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703570#action_12703570
 ]


Michael McCandless commented on LUCENE-1593:
--------------------------------------------

bq. Ok sleeping did help.

OK...good morning (afternoon)!

bq. BTW, we should be aware that this means anyone using HitQueue needs to know 
that upon initialization it's filled with sentinel objects, and that its size() 
will be maxSize etc. Since HQ is package private I don't have a problem with it.

Good point -- can you update HitQueue's javadocs stating this new
behavior?

bq. BTW, IndexSearch.doSearch creates the Scorer, but already receives the 
Collector as argument, therefore at this point it's too late to make any 
decisions regarding orderness of docs, no?

Urgh yeah right.  So docsInOrder() should be added to Weight or Query
I guess.  Maybe name it "scoresDocsInOrder()"?

bq. Add docsInOrder to Weight (it's an interface, therefore just in 3.0)

Aside: maybe for 3.0 we should do a hard cutover of any remaining
interfaces, like Weight, Fieldable (if we don't replace it), etc. to
abstract classes?

bq. Remember that it may be used by IndexSearcher in two modes: (1) without a 
filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo().

I'd really love to find a way to make this more explicit.  EG you ask
the Weight for a topScorer() vs a scorer(), or something.  Clearly the
caller of .scorer() knows full well how the instance will be used (top
or not)... we keep struggling with BS/2 because this information is
not explicit now.

This would also enable BQ.scorer() to directly return a BS vs BS2,
rather than the awkward BS2 wrapping a BS internally in its
score(Collector) method.

So how about adding a "topScorer()" method, that defaults to scorer()?
(Ugh, we can't do that until 3.0 since Weight is an interface).

But actually: the thing calling scoresDocsInOrder will in fact only be
calling that method if it intends to use the scorer as a toplevel
scorer (ie, it will call scorer.score(Collector), not
scorer.next/skipTo)?  So BQ.scoresDocsInOrder would first check if
it's gonna defer to BS (it's a simple OR query) and then check BS's
static setting.

bq. In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create 
out-of-order collectors. That might impact performance if there are no BQ 
clauses, but I assume it is not used much? And this doesn't break back-compat 
since that's the only way to instantiate an out-of-order Scorer today (besides 
creating your own).

This back compat break worries me a bit.  EG maybe Solr has some
scorers that run out-of-order?

Also this is not really a clean solution: sure, it's only BQ today
that does out-of-order scoring, but I can see others doing it in the
future.  I can also see making out-of-order-scoring more common and in
fact the default (again) for BQ, since it does give good performance
gains.  Maybe other Query scorers should use it too.

So I think I'd prefer the "add scoresDocsOutOfOrder" to Query.  Note
that this must be called on the rewritten Query.

bq. And since they can always create the Query and only then create the 
Collector, if we add that info to Query they should have enough information at 
hand to create the proper Collector instance.

Right, adding the method to Query gives expert users the tools needed
to make their own efficient collectors.

bq. If we do add it to Query, then I'd like to deprecate BQ's static setter and 
getter of that attribute and provide a docsInOrder() impl, but we need to 
resolve how it will know whether it will use BS or BS2.

OK, you mean make this a non-static setter/getter?  I would actually
prefer to default it to "true", as well.  (It's better performance).
But that should wait for 3.0 (be sure to open a "for 3.0" followon
issue for this one, too!).

BTW, even though BS does score docs out of order, it still visits docs
in order "by bucket".  Meaning when visiting docs, you know the docIDs
are always greater than the last bucket's docIDs.

This gives us a source of optimization: if we hold onto the "bottom
value in the pqueue as of the last bucket", and then we can first
compare that bottom value (with no tie breaking by docID needed) and
if that competes, then we do the "true, current bottomValue with
breaking tie by docID" comparison to see if it makes it into the
queue.  But I'm not sure how we'd cleanly model this "docIDs are in
order by bucket" case of the scorer, and take advantage of that during
collection.  I think it'd require extending the FieldComparator API
somehow, eg "get me a BottomValueComparator instance as of right
now".

This can also be the basis for a separate strong optimization, which
is down in the TermScorer for a BooleanScorer/2, skip a docID if its
field value is not competitive.  This is a bigger change (way outside
the scope of this issue, and in fact more related to LUCENE-1536).


> Optimizations to TopScoreDocCollector and TopFieldCollector
> -----------------------------------------------------------
>
>                 Key: LUCENE-1593
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1593
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>         Attachments: LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

Reply via email to