Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

Miles Osborne Fri, 25 Mar 2011 03:46:18 -0700

this is something that I have been concerned about for a long time
now.  and things are actually worse than this, since often only a
single language pair / test set / training set is used.  claims cannot
be made on the basis of such shaky evidence,


Miles

On 25 March 2011 09:42, Suzy Howlett <[email protected]> wrote:
> I've been thinking about the issue of nondeterminism and am somewhat
> concerned because typically MT results/papers give just a single
> performance figure for each system. As there is an element of
> nondeterministic behaviour, it would seem prudent to run several repeats
> of each system and give mean and standard deviation information instead.
> Of course, this has a practicality trade-off, so an investigation is
> warranted to determine the scale of the problem. Is anyone interested in
> collaborating on a paper or CL squib to address the issue, and bring it
> to the attention of the MT community (and CL community at large)?
>
> Suzy
>
> On 25/03/11 11:58 AM, Tom Hoar wrote:
>> We pick the random set from across the entire collection of documents.
>> The documents are retrieved as the file system orders them (not
>> alphabetically sorted). Your comment, "picked in consecutive order" is
>> interesting. I've often wondered if the order could affect a system's
>> performance. It's easy enough for me to randomize both the collection
>> line order and the test set line order.
>>
>> The large variance in BLEU would normally be alarming, but this is on a
>> very small sample corpus of only 40,000 lines. We use the sample corpus
>> to validate the system installs properly. We haven't seen such large
>> variations in multi-million pair corpora, but they do range 2-4 BLEU
>> points.
>>
>> Tom
>>
>>
>> -----Original Message-----
>> *From*: Hieu Hoang <[email protected]
>> <mailto:hieu%20hoang%20%[email protected]%3e>>
>> *To*: [email protected] <mailto:[email protected]>
>> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same
>> config, different n-best lists
>> *Date*: Thu, 24 Mar 2011 20:43:49 +0000
>>
>> There may be some systematic differences between the randomly choosen
>> test sets, eg. the sentences are from the same documents 'cos they were
>> picked in consecutive order from a multi-doc corpus. Otherwise, I'll be
>> worried about such a large BLEU variation.
>>
>>
>>
>> also, see here on the evils of MERT
>> http://www.mail-archive.com/[email protected]/msg00216.html
>>
>>
>> On 24/03/2011 16:06, Tom Hoar wrote:
>>> We often run multiple trainings on the exact same bitext corpus but
>>> pull different random samples for each run. We've observed drastically
>>> different BLEU scores between different runs with BLEUs ranging from
>>> 30 to 45. This is from exactly the same training data except for the
>>> randomly-pulled tuning and evaluation sets. We've assumed this
>>> difference is due to both the random differences in the sets, floating
>>> point variations between various machines and not using
>>> --predictable-seeds.
>>>
>>> Tom
>>>
>>>
>>>
>>> -----Original Message-----
>>> *From*: Hieu Hoang <[email protected]
>>> <mailto:hieu%20hoang%20%[email protected]%3e>>
>>> *Reply-to*: [email protected] <mailto:[email protected]>
>>> *To*: John Burger <[email protected]
>>> <mailto:john%20burger%20%[email protected]%3e>>
>>> *Cc*: Moses-support <[email protected]
>>> <mailto:moses-support%20%[email protected]%3e>>
>>> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same
>>> config, different n-best lists
>>> *Date*: Thu, 24 Mar 2011 15:51:48 +0000
>>>
>>> there's little differences in floating point between OS and gcc
>>> versions. One of the regression test fails because of rounding errors,
>>> depending on which machine you run it on. Other than truncating the
>>> scores, there's not a lot we can do.
>>>
>>> The mert perl scripts also dabbles in the scores and that may be
>>> another source of divergence
>>>
>>> On 24 March 2011 15:07, John Burger <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Lane Schwartz wrote:
>>>
>>>     > I've examined the n-best lists, and it seems there are at least a
>>>     > couple of interesting cases. In the simplest case, several
>>>     > translations of a given sentence produce the exact same score, and
>>>     > these tied translations appear in different order during different
>>>
>>>     > runs. This is a bit odd, but [not] terribly worrisome. The stranger
>>>     > case is when there are two different decoding runs, and for a given
>>>     > sentence, there are translations that appear only in run A, and
>>>     > different translations that only appear in run B.
>>>
>>>
>>>     Both these cases are relevant to something we've occasionally seen,
>>>     which is non-determinism during =tuning=. This is not surprising
>>>     given the above, since tuning of course involves decoding. It's hard
>>>     to reproduce, but we have sometimes seen very different weights coming
>>>     out of MERT for the exact same system configurations. The problem
>>>     here is that even very small differences in tuning can result in
>>>     substantial differences in test results, because of how twitchy
>>>     BLEU is.
>>>
>>>     Like many folks, we typically run MERT on a cluster. This brings up
>>>     another source of non-determinism we've theorized about. Some of our
>>>     clusters are heterogenous, and we've wondered if there might be minor
>>>     differences in floating point behavior from machine to machine. The
>>>     assignment of different chunks of the tuning data to different
>>>     machines is typically non-deterministic, so this might carry over to
>>>     the actual weights that come out of MERT.
>>>
>>>     Does anyone know how robust the floating point usage in the decoder is
>>>     under these circumstances?
>>>
>>>     Thanks.
>>>
>>>     - John Burger
>>>     MITRE
>>>
>>>     _______________________________________________
>>>     Moses-support mailing list
>>>     [email protected] <mailto:[email protected]>
>>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]  <mailto:[email protected]>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]  <mailto:[email protected]>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]  <mailto:[email protected]>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> --
> Suzy Howlett
> http://www.showlett.id.au/
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

Reply via email to