Re: Repeatability of results

Alan Bawden Wed, 04 Apr 2012 15:16:23 -0700

So I sat down to try to make a small test case that exhibited this
behavior, and while I was working on that I thought of a possible
explanation for what we are seeing.  If you agree that my explanation is
what's going on here, then Benson and I can stop working on making a test
case, and move on to figuring out how we can live with what may be
unavoidable behavior.


The key observation is that the differences in scores we see are always
down around the sixth decimal place -- down where 32-bit floating point
loses precision.  So what we're seeing seems likely to simply be the result
of the fact that floating point addition isn't associative.

In theory, the order of the documents in an index doesn't matter when
computing a score, but if the documents are stored in a different order,
any quantity that is computed using floating point by iterating over the
set of documents may come out differently due to changes in the order in
which the documents are processed.

So could something like this cause what we are seeing?

On Mon, Apr 2, 2012 at 17:41, Benson Margulies <bimargul...@gmail.com> wrote:
> On Mon, Apr 2, 2012 at 5:33 PM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>> Hmm that's odd.
>>
>> If the scores were identical I'd expect different sort order, since we
>> tie-break by internal docID.
>>
>> But if the scores are different... the insertion order shouldn't
>> matter.  And, the score should not change as a function of insertion
>> order...
>
> Well, I assumed that TF-IDF would wiggle.
>
>>
>> Do you have a small test case?
>
> SInce this surprises you, I will build a test case.
>
>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Mon, Apr 2, 2012 at 5:28 PM, Benson Margulies <bimargul...@gmail.com> 
>> wrote:
>>> We've observed something that, in some ways, is not surprising.
>>>
>>> If you take a set of documents that are close in 'score' to some query,
>>>
>>>  and shuffle them in different orders
>>>
>>>  and then see what results you get in what order from the reference query,
>>>
>>> the scores will vary according to the insertion order.
>>>
>>> I can't see any way to argue that it's wrong, but we find it
>>> inconvenient when we are testing something and we want to multithread
>>> the test to speed it up, thus making the insertion order
>>> nondeterministic.
>>>
>>> It occurred to me that perhaps you all have some similar concerns in
>>> testing lucene itself, and might have some advice about how to get
>>> around it, thus this email.
>>>
>>> We currently observe this with 2.9.1 and 3.5.0.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Repeatability of results

Reply via email to