Participants in this discussion from a few weeks ago will probably be interested in this upcoming ACL 2011 paper: http://www.cs.cmu.edu/~jhclark/pubs/significance.pdf
Cheers Adam On Fri, Mar 25, 2011 at 8:49 AM, Lane Schwartz <[email protected]> wrote: > We know that there is nondeterminism during optimization, yet virtually all > papers report results based on a single MERT run. We know that results can > very widely based on language pair and data sets, but a large majority of > papers report results on a single language pair, and often for a single data > set. > > While these issues are widely known at the informal level, I think that > Suzy's point is well taken. I think there would be value in published > studies showing just how wide the gap due to nondeterminism can be expected > to be. It may be that such studies already exist, and I'm just not aware of > them. Does anyone know of any? > > Cheers, > Lane > > On Fri, Mar 25, 2011 at 7:03 AM, Barry Haddow <[email protected]> wrote: >> >> Hi >> >> This is an issue which is not just faced by SMT, but probably by all >> research >> fields. Evidence from one paper doesn't generally prove or disprove that a >> technique works, you need to consider lots of evidence, from different >> workers in different labs. >> >> As a young field, SMT has its own problems in building up good >> experimental >> practices, which are not helped by the tendency to over-sell in research >> papers, and ignore the non-determinism in many parts of the pipeline. >> Non-reproducibility is also a problem, as much of the code used in papers >> is >> not released, and the complete list of settings required to rerun an >> experiment are rarely given. These problems have been acknowledged, and >> initiatives proposed to address them, but they're far from solved, >> >> best regards - Barry >> >> On Friday 25 March 2011 10:44, Miles Osborne wrote: >> > this is something that I have been concerned about for a long time >> > now. and things are actually worse than this, since often only a >> > single language pair / test set / training set is used. claims cannot >> > be made on the basis of such shaky evidence, >> > >> > Miles >> > >> > On 25 March 2011 09:42, Suzy Howlett <[email protected]> wrote: >> > > I've been thinking about the issue of nondeterminism and am somewhat >> > > concerned because typically MT results/papers give just a single >> > > performance figure for each system. As there is an element of >> > > nondeterministic behaviour, it would seem prudent to run several >> > > repeats >> > > of each system and give mean and standard deviation information >> > > instead. >> > > Of course, this has a practicality trade-off, so an investigation is >> > > warranted to determine the scale of the problem. Is anyone interested >> > > in >> > > collaborating on a paper or CL squib to address the issue, and bring >> > > it >> > > to the attention of the MT community (and CL community at large)? >> > > >> > > Suzy >> > > >> > > On 25/03/11 11:58 AM, Tom Hoar wrote: >> > >> We pick the random set from across the entire collection of >> > >> documents. >> > >> The documents are retrieved as the file system orders them (not >> > >> alphabetically sorted). Your comment, "picked in consecutive order" >> > >> is >> > >> interesting. I've often wondered if the order could affect a system's >> > >> performance. It's easy enough for me to randomize both the collection >> > >> line order and the test set line order. >> > >> >> > >> The large variance in BLEU would normally be alarming, but this is on >> > >> a >> > >> very small sample corpus of only 40,000 lines. We use the sample >> > >> corpus >> > >> to validate the system installs properly. We haven't seen such large >> > >> variations in multi-million pair corpora, but they do range 2-4 BLEU >> > >> points. >> > >> >> > >> Tom >> > >> >> > >> >> > >> -----Original Message----- >> > >> *From*: Hieu Hoang <[email protected] >> > >> <mailto:hieu%20hoang%20%[email protected]%3e>> >> > >> *To*: [email protected] <mailto:[email protected]> >> > >> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same >> > >> config, different n-best lists >> > >> *Date*: Thu, 24 Mar 2011 20:43:49 +0000 >> > >> >> > >> There may be some systematic differences between the randomly choosen >> > >> test sets, eg. the sentences are from the same documents 'cos they >> > >> were >> > >> picked in consecutive order from a multi-doc corpus. Otherwise, I'll >> > >> be >> > >> worried about such a large BLEU variation. >> > >> >> > >> >> > >> >> > >> also, see here on the evils of MERT >> > >> http://www.mail-archive.com/[email protected]/msg00216.html >> > >> >> > >> On 24/03/2011 16:06, Tom Hoar wrote: >> > >>> We often run multiple trainings on the exact same bitext corpus but >> > >>> pull different random samples for each run. We've observed >> > >>> drastically >> > >>> different BLEU scores between different runs with BLEUs ranging from >> > >>> 30 to 45. This is from exactly the same training data except for the >> > >>> randomly-pulled tuning and evaluation sets. We've assumed this >> > >>> difference is due to both the random differences in the sets, >> > >>> floating >> > >>> point variations between various machines and not using >> > >>> --predictable-seeds. >> > >>> >> > >>> Tom >> > >>> >> > >>> >> > >>> >> > >>> -----Original Message----- >> > >>> *From*: Hieu Hoang <[email protected] >> > >>> <mailto:hieu%20hoang%20%[email protected]%3e>> >> > >>> *Reply-to*: [email protected] <mailto:[email protected]> >> > >>> *To*: John Burger <[email protected] >> > >>> <mailto:john%20burger%20%[email protected]%3e>> >> > >>> *Cc*: Moses-support <[email protected] >> > >>> <mailto:moses-support%20%[email protected]%3e>> >> > >>> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same >> > >>> config, different n-best lists >> > >>> *Date*: Thu, 24 Mar 2011 15:51:48 +0000 >> > >>> >> > >>> there's little differences in floating point between OS and gcc >> > >>> versions. One of the regression test fails because of rounding >> > >>> errors, >> > >>> depending on which machine you run it on. Other than truncating the >> > >>> scores, there's not a lot we can do. >> > >>> >> > >>> The mert perl scripts also dabbles in the scores and that may be >> > >>> another source of divergence >> > >>> >> > >>> On 24 March 2011 15:07, John Burger <[email protected] >> > >>> <mailto:[email protected]>> wrote: >> > >>> >> > >>> Lane Schwartz wrote: >> > >>> >> > >>> > I've examined the n-best lists, and it seems there are at >> > >>> least a >> > >>> > couple of interesting cases. In the simplest case, several >> > >>> > translations of a given sentence produce the exact same score, >> > >>> and > these tied translations appear in different order during >> > >>> different >> > >>> >> > >>> > runs. This is a bit odd, but [not] terribly worrisome. The >> > >>> stranger > case is when there are two different decoding runs, and >> > >>> for >> > >>> a given > sentence, there are translations that appear only in run >> > >>> A, >> > >>> and > different translations that only appear in run B. >> > >>> >> > >>> >> > >>> Both these cases are relevant to something we've occasionally >> > >>> seen, >> > >>> which is non-determinism during =tuning=. This is not surprising >> > >>> given the above, since tuning of course involves decoding. It's >> > >>> hard to reproduce, but we have sometimes seen very different weights >> > >>> coming out of MERT for the exact same system configurations. The >> > >>> problem here is that even very small differences in tuning can >> > >>> result >> > >>> in substantial differences in test results, because of how twitchy >> > >>> BLEU >> > >>> is. >> > >>> >> > >>> Like many folks, we typically run MERT on a cluster. This brings >> > >>> up >> > >>> another source of non-determinism we've theorized about. Some of >> > >>> our clusters are heterogenous, and we've wondered if there might be >> > >>> minor differences in floating point behavior from machine to >> > >>> machine. >> > >>> The assignment of different chunks of the tuning data to different >> > >>> machines is typically non-deterministic, so this might carry over to >> > >>> the actual weights that come out of MERT. >> > >>> >> > >>> Does anyone know how robust the floating point usage in the >> > >>> decoder >> > >>> is under these circumstances? >> > >>> >> > >>> Thanks. >> > >>> >> > >>> - John Burger >> > >>> MITRE >> > >>> >> > >>> _______________________________________________ >> > >>> Moses-support mailing list >> > >>> [email protected] <mailto:[email protected]> >> > >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> > >>> >> > >>> >> > >>> >> > >>> _______________________________________________ >> > >>> Moses-support mailing list >> > >>> [email protected] <mailto:[email protected]> >> > >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> > >>> >> > >>> >> > >>> _______________________________________________ >> > >>> Moses-support mailing list >> > >>> [email protected] <mailto:[email protected]> >> > >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> > >> >> > >> _______________________________________________ >> > >> Moses-support mailing list >> > >> [email protected] <mailto:[email protected]> >> > >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > >> >> > >> >> > >> >> > >> >> > >> _______________________________________________ >> > >> Moses-support mailing list >> > >> [email protected] >> > >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > >> > > -- >> > > Suzy Howlett >> > > http://www.showlett.id.au/ >> > > _______________________________________________ >> > > Moses-support mailing list >> > > [email protected] >> > > http://mailman.mit.edu/mailman/listinfo/moses-support >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > -- > When a place gets crowded enough to require ID's, social collapse is not > far away. It is time to go elsewhere. The best thing about space travel > is that it made it possible to go elsewhere. > -- R.A. Heinlein, "Time Enough For Love" > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
