We often run multiple trainings on the exact same bitext corpus but pull
different random samples for each run. We've observed drastically
different BLEU scores between different runs with BLEUs ranging from 30
to 45. This is from exactly the same training data except for the
randomly-pulled tuning and evaluation sets. We've assumed this
difference is due to both the random differences in the sets, floating
point variations between various machines and not using
--predictable-seeds.

Tom



-----Original Message-----
From: Hieu Hoang <[email protected]>
Reply-to: [email protected]
To: John Burger <[email protected]>
Cc: Moses-support <[email protected]>
Subject: Re: [Moses-support] Nondeterminism during decoding: same
config, different n-best lists
Date: Thu, 24 Mar 2011 15:51:48 +0000

there's little differences in floating point between OS and gcc
versions. One of the regression test fails because of rounding errors,
depending on which machine you run it on. Other than truncating the
scores, there's not a lot we can do.

The mert perl scripts also dabbles in the scores and that may be another
source of divergence

On 24 March 2011 15:07, John Burger <[email protected]> wrote:
        Lane Schwartz wrote:
        
        > I've examined the n-best lists, and it seems there are at
        least a
        > couple of interesting cases. In the simplest case, several
        > translations of a given sentence produce the exact same score,
        and
        > these tied translations appear in different order during
        different
        
        
        > runs. This is a bit odd, but [not] terribly worrisome. The
        stranger
        
        > case is when there are two different decoding runs, and for a
        given
        > sentence, there are translations that appear only in run A,
        and
        > different translations that only appear in run B.
        
        
        
        Both these cases are relevant to something we've occasionally
        seen,
        which is non-determinism during =tuning=.  This is not
        surprising
        given the above, since tuning of course involves decoding.  It's
        hard
        to reproduce, but we have sometimes seen very different weights
        coming
        out of MERT for the exact same system configurations.  The
        problem
        here is that even very small differences in tuning can result in
        substantial differences in test results, because of how twitchy
        BLEU is.
        
        Like many folks, we typically run MERT on a cluster.  This
        brings up
        another source of non-determinism we've theorized about.  Some
        of our
        clusters are heterogenous, and we've wondered if there might be
        minor
        differences in floating point behavior from machine to machine.
         The
        assignment of different chunks of the tuning data to different
        machines is typically non-deterministic, so this might carry
        over to
        the actual weights that come out of MERT.
        
        Does anyone know how robust the floating point usage in the
        decoder is
        under these circumstances?
        
        Thanks.
        
        - John Burger
          MITRE
        
        
        _______________________________________________
        Moses-support mailing list
        [email protected]
        http://mailman.mit.edu/mailman/listinfo/moses-support
        
        


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to