[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

Marvin Humphrey (JIRA) Sat, 20 Feb 2010 08:24:52 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836197#action_12836197
 ]


Marvin Humphrey commented on LUCENE-2271:
-----------------------------------------

An awful lot of thought went into optimizing those collection algorithms.  I
disagree with many of the design decisions that were made, but it seems rushed
to blithely revert those optimizations.

FWIW, the SortCollector in KS (designed on the Lucy list last spring, would be
in Lucy but some prereqs haven't gone in yet) doesn't have the problem with
-Inf sentinels.  It uses an array of "actions" representing sort rules to
determine whether a hit is "competitive" and should be inserted into the
queue; the first action is set to AUTO_ACCEPT (meaning try inserting the hit
into the queue) until the queue fills up, and then again to AUTO_ACCEPT at the
start of each segment.  It's not necessary to fill up the queue with dummy
hits beforehand.

{code:none}
static INLINE bool_t
SI_competitive(SortCollector *self, i32_t doc_id)
{
    u8_t *const actions = self->actions;
    u32_t i = 0;

    /* Iterate through our array of actions, returning as quickly as
     * possible. */
    do {
        switch (actions[i] & ACTIONS_MASK) {
            case AUTO_ACCEPT:
                return true;
            case AUTO_REJECT:
                return false;
            case AUTO_TIE:
                break;
            case COMPARE_BY_SCORE: {
                    float score = Matcher_Score(self->matcher);
                    if  (score > self->bubble_score) {
                        self->bumped->score = score;
                        return true;
                    }
                    else if (score < self->bubble_score) {
                        return false;
                    }
                }
                break;
            case COMPARE_BY_SCORE_REV: {
                // ...
            case COMPARE_BY_DOC_ID:
                // ...
            case COMPARE_BY_ORD1: {
                // ...
        }
    } while (++i < self->num_actions);

    /* If we've made it this far and we're still tied, reject the doc so that
     * we prefer items already in the queue.  This has the effect of
     * implicitly breaking ties by doc num, since docs are collected in order.
     */
    return false;
}
{code}

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2271
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2271
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Priority: Minor
>             Fix For: 2.9.2, 3.0.1, 3.1
>
>         Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, 
> LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
>     // Since docs are returned in-order (i.e., increasing doc Id), a document
>     // with equal score to pqTop.score cannot compete since HitQueue favors
>     // documents with lower doc Ids. Therefore reject those docs too.
>     return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
>     if (hitA.doc == Integer.MAX_VALUE)
>       return true;
>     if (hitB.doc == Integer.MAX_VALUE)
>       return false;
>     if (hitA.score == hitB.score)
>       return hitA.doc > hitB.doc; 
>     else
>       return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
>     if (hitA.score == hitB.score)
>       return hitA.doc > hitB.doc; 
>     else
>       return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

Reply via email to