Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

Hieu Hoang Thu, 24 Mar 2011 13:44:44 -0700

There may be some systematic differences between the randomly choosentest sets, eg. the sentences are from the same documents 'cos they werepicked in consecutive order from a multi-doc corpus. Otherwise, I'll beworried about such a large BLEU variation.


also, see here on the evils of MERT
http://www.mail-archive.com/[email protected]/msg00216.html



On 24/03/2011 16:06, Tom Hoar wrote:

We often run multiple trainings on the exact same bitext corpus butpull different random samples for each run. We've observed drasticallydifferent BLEU scores between different runs with BLEUs ranging from30 to 45. This is from exactly the same training data except for therandomly-pulled tuning and evaluation sets. We've assumed thisdifference is due to both the random differences in the sets, floatingpoint variations between various machines and not using--predictable-seeds.
Tom



-----Original Message-----
*From*: Hieu Hoang <[email protected]<mailto:hieu%20hoang%20%[email protected]%3e>>
*Reply-to*: [email protected]
*To*: John Burger <[email protected]<mailto:john%20burger%20%[email protected]%3e>>*Cc*: Moses-support <[email protected]<mailto:moses-support%20%[email protected]%3e>>*Subject*: Re: [Moses-support] Nondeterminism during decoding: sameconfig, different n-best lists
*Date*: Thu, 24 Mar 2011 15:51:48 +0000
there's little differences in floating point between OS and gccversions. One of the regression test fails because of rounding errors,depending on which machine you run it on. Other than truncating thescores, there's not a lot we can do.
The mert perl scripts also dabbles in the scores and that may beanother source of divergence
On 24 March 2011 15:07, John Burger <[email protected]<mailto:[email protected]>> wrote:
    Lane Schwartz wrote:

    > I've examined the n-best lists, and it seems there are at least a
    > couple of interesting cases. In the simplest case, several
    > translations of a given sentence produce the exact same score, and
    > these tied translations appear in different order during different
> runs. This is a bit odd, but [not] terribly worrisome. The stranger
    > case is when there are two different decoding runs, and for a given
    > sentence, there are translations that appear only in run A, and
    > different translations that only appear in run B.


    Both these cases are relevant to something we've occasionally seen,
    which is non-determinism during =tuning=.  This is not surprising
    given the above, since tuning of course involves decoding.  It's hard
    to reproduce, but we have sometimes seen very different weights coming
    out of MERT for the exact same system configurations.  The problem
    here is that even very small differences in tuning can result in
    substantial differences in test results, because of how twitchy
    BLEU is.

    Like many folks, we typically run MERT on a cluster.  This brings up
    another source of non-determinism we've theorized about.  Some of our
    clusters are heterogenous, and we've wondered if there might be minor
    differences in floating point behavior from machine to machine.  The
    assignment of different chunks of the tuning data to different
    machines is typically non-deterministic, so this might carry over to
    the actual weights that come out of MERT.

    Does anyone know how robust the floating point usage in the decoder is
    under these circumstances?

    Thanks.

    - John Burger
MITRE
    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support



_______________________________________________
Moses-support mailing list
[email protected]  <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

Reply via email to