this is something that I have been concerned about for a long time now. and things are actually worse than this, since often only a single language pair / test set / training set is used. claims cannot be made on the basis of such shaky evidence,
Miles On 25 March 2011 09:42, Suzy Howlett <[email protected]> wrote: > I've been thinking about the issue of nondeterminism and am somewhat > concerned because typically MT results/papers give just a single > performance figure for each system. As there is an element of > nondeterministic behaviour, it would seem prudent to run several repeats > of each system and give mean and standard deviation information instead. > Of course, this has a practicality trade-off, so an investigation is > warranted to determine the scale of the problem. Is anyone interested in > collaborating on a paper or CL squib to address the issue, and bring it > to the attention of the MT community (and CL community at large)? > > Suzy > > On 25/03/11 11:58 AM, Tom Hoar wrote: >> We pick the random set from across the entire collection of documents. >> The documents are retrieved as the file system orders them (not >> alphabetically sorted). Your comment, "picked in consecutive order" is >> interesting. I've often wondered if the order could affect a system's >> performance. It's easy enough for me to randomize both the collection >> line order and the test set line order. >> >> The large variance in BLEU would normally be alarming, but this is on a >> very small sample corpus of only 40,000 lines. We use the sample corpus >> to validate the system installs properly. We haven't seen such large >> variations in multi-million pair corpora, but they do range 2-4 BLEU >> points. >> >> Tom >> >> >> -----Original Message----- >> *From*: Hieu Hoang <[email protected] >> <mailto:hieu%20hoang%20%[email protected]%3e>> >> *To*: [email protected] <mailto:[email protected]> >> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same >> config, different n-best lists >> *Date*: Thu, 24 Mar 2011 20:43:49 +0000 >> >> There may be some systematic differences between the randomly choosen >> test sets, eg. the sentences are from the same documents 'cos they were >> picked in consecutive order from a multi-doc corpus. Otherwise, I'll be >> worried about such a large BLEU variation. >> >> >> >> also, see here on the evils of MERT >> http://www.mail-archive.com/[email protected]/msg00216.html >> >> >> On 24/03/2011 16:06, Tom Hoar wrote: >>> We often run multiple trainings on the exact same bitext corpus but >>> pull different random samples for each run. We've observed drastically >>> different BLEU scores between different runs with BLEUs ranging from >>> 30 to 45. This is from exactly the same training data except for the >>> randomly-pulled tuning and evaluation sets. We've assumed this >>> difference is due to both the random differences in the sets, floating >>> point variations between various machines and not using >>> --predictable-seeds. >>> >>> Tom >>> >>> >>> >>> -----Original Message----- >>> *From*: Hieu Hoang <[email protected] >>> <mailto:hieu%20hoang%20%[email protected]%3e>> >>> *Reply-to*: [email protected] <mailto:[email protected]> >>> *To*: John Burger <[email protected] >>> <mailto:john%20burger%20%[email protected]%3e>> >>> *Cc*: Moses-support <[email protected] >>> <mailto:moses-support%20%[email protected]%3e>> >>> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same >>> config, different n-best lists >>> *Date*: Thu, 24 Mar 2011 15:51:48 +0000 >>> >>> there's little differences in floating point between OS and gcc >>> versions. One of the regression test fails because of rounding errors, >>> depending on which machine you run it on. Other than truncating the >>> scores, there's not a lot we can do. >>> >>> The mert perl scripts also dabbles in the scores and that may be >>> another source of divergence >>> >>> On 24 March 2011 15:07, John Burger <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Lane Schwartz wrote: >>> >>> > I've examined the n-best lists, and it seems there are at least a >>> > couple of interesting cases. In the simplest case, several >>> > translations of a given sentence produce the exact same score, and >>> > these tied translations appear in different order during different >>> >>> > runs. This is a bit odd, but [not] terribly worrisome. The stranger >>> > case is when there are two different decoding runs, and for a given >>> > sentence, there are translations that appear only in run A, and >>> > different translations that only appear in run B. >>> >>> >>> Both these cases are relevant to something we've occasionally seen, >>> which is non-determinism during =tuning=. This is not surprising >>> given the above, since tuning of course involves decoding. It's hard >>> to reproduce, but we have sometimes seen very different weights coming >>> out of MERT for the exact same system configurations. The problem >>> here is that even very small differences in tuning can result in >>> substantial differences in test results, because of how twitchy >>> BLEU is. >>> >>> Like many folks, we typically run MERT on a cluster. This brings up >>> another source of non-determinism we've theorized about. Some of our >>> clusters are heterogenous, and we've wondered if there might be minor >>> differences in floating point behavior from machine to machine. The >>> assignment of different chunks of the tuning data to different >>> machines is typically non-deterministic, so this might carry over to >>> the actual weights that come out of MERT. >>> >>> Does anyone know how robust the floating point usage in the decoder is >>> under these circumstances? >>> >>> Thanks. >>> >>> - John Burger >>> MITRE >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] <mailto:[email protected]> >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] <mailto:[email protected]> >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] <mailto:[email protected]> >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ >> Moses-support mailing list >> [email protected] <mailto:[email protected]> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > -- > Suzy Howlett > http://www.showlett.id.au/ > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
