Lane Schwartz wrote:

> I've examined the n-best lists, and it seems there are at least a  
> couple of interesting cases. In the simplest case, several  
> translations of a given sentence produce the exact same score, and  
> these tied translations appear in different order during different  
> runs. This is a bit odd, but [not] terribly worrisome. The stranger  
> case is when there are two different decoding runs, and for a given  
> sentence, there are translations that appear only in run A, and  
> different translations that only appear in run B.

Both these cases are relevant to something we've occasionally seen,  
which is non-determinism during =tuning=.  This is not surprising  
given the above, since tuning of course involves decoding.  It's hard  
to reproduce, but we have sometimes seen very different weights coming  
out of MERT for the exact same system configurations.  The problem  
here is that even very small differences in tuning can result in  
substantial differences in test results, because of how twitchy BLEU is.

Like many folks, we typically run MERT on a cluster.  This brings up  
another source of non-determinism we've theorized about.  Some of our  
clusters are heterogenous, and we've wondered if there might be minor  
differences in floating point behavior from machine to machine.  The  
assignment of different chunks of the tuning data to different  
machines is typically non-deterministic, so this might carry over to  
the actual weights that come out of MERT.

Does anyone know how robust the floating point usage in the decoder is  
under these circumstances?

Thanks.

- John Burger
   MITRE
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to