Participants in this discussion from a few weeks ago will probably be
interested in this upcoming ACL 2011 paper:
http://www.cs.cmu.edu/~jhclark/pubs/significance.pdf

Cheers
Adam

On Fri, Mar 25, 2011 at 8:49 AM, Lane Schwartz <[email protected]> wrote:
> We know that there is nondeterminism during optimization, yet virtually all
> papers report results based on a single MERT run. We know that results can
> very widely based on language pair and data sets, but a large majority of
> papers report results on a single language pair, and often for a single data
> set.
>
> While these issues are widely known at the informal level, I think that
> Suzy's point is well taken. I think there would be value in published
> studies showing just how wide the gap due to nondeterminism can be expected
> to be. It may be that such studies already exist, and I'm just not aware of
> them. Does anyone know of any?
>
> Cheers,
> Lane
>
> On Fri, Mar 25, 2011 at 7:03 AM, Barry Haddow <[email protected]> wrote:
>>
>> Hi
>>
>> This is an issue which is not just faced by SMT, but probably by all
>> research
>> fields. Evidence from one paper doesn't generally prove or disprove that a
>> technique works, you need to consider lots of evidence, from different
>> workers in different labs.
>>
>> As a young field, SMT has its own problems in building up good
>> experimental
>> practices, which are not helped by the tendency to over-sell in research
>> papers, and ignore the non-determinism in many parts of the pipeline.
>> Non-reproducibility is also a problem, as much of the code used in papers
>> is
>> not released, and the complete list of settings required to rerun an
>> experiment are rarely given. These problems have been acknowledged, and
>> initiatives proposed to address them, but they're far from solved,
>>
>> best regards - Barry
>>
>> On Friday 25 March 2011 10:44, Miles Osborne wrote:
>> > this is something that I have been concerned about for a long time
>> > now.  and things are actually worse than this, since often only a
>> > single language pair / test set / training set is used.  claims cannot
>> > be made on the basis of such shaky evidence,
>> >
>> > Miles
>> >
>> > On 25 March 2011 09:42, Suzy Howlett <[email protected]> wrote:
>> > > I've been thinking about the issue of nondeterminism and am somewhat
>> > > concerned because typically MT results/papers give just a single
>> > > performance figure for each system. As there is an element of
>> > > nondeterministic behaviour, it would seem prudent to run several
>> > > repeats
>> > > of each system and give mean and standard deviation information
>> > > instead.
>> > > Of course, this has a practicality trade-off, so an investigation is
>> > > warranted to determine the scale of the problem. Is anyone interested
>> > > in
>> > > collaborating on a paper or CL squib to address the issue, and bring
>> > > it
>> > > to the attention of the MT community (and CL community at large)?
>> > >
>> > > Suzy
>> > >
>> > > On 25/03/11 11:58 AM, Tom Hoar wrote:
>> > >> We pick the random set from across the entire collection of
>> > >> documents.
>> > >> The documents are retrieved as the file system orders them (not
>> > >> alphabetically sorted). Your comment, "picked in consecutive order"
>> > >> is
>> > >> interesting. I've often wondered if the order could affect a system's
>> > >> performance. It's easy enough for me to randomize both the collection
>> > >> line order and the test set line order.
>> > >>
>> > >> The large variance in BLEU would normally be alarming, but this is on
>> > >> a
>> > >> very small sample corpus of only 40,000 lines. We use the sample
>> > >> corpus
>> > >> to validate the system installs properly. We haven't seen such large
>> > >> variations in multi-million pair corpora, but they do range 2-4 BLEU
>> > >> points.
>> > >>
>> > >> Tom
>> > >>
>> > >>
>> > >> -----Original Message-----
>> > >> *From*: Hieu Hoang <[email protected]
>> > >> <mailto:hieu%20hoang%20%[email protected]%3e>>
>> > >> *To*: [email protected] <mailto:[email protected]>
>> > >> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same
>> > >> config, different n-best lists
>> > >> *Date*: Thu, 24 Mar 2011 20:43:49 +0000
>> > >>
>> > >> There may be some systematic differences between the randomly choosen
>> > >> test sets, eg. the sentences are from the same documents 'cos they
>> > >> were
>> > >> picked in consecutive order from a multi-doc corpus. Otherwise, I'll
>> > >> be
>> > >> worried about such a large BLEU variation.
>> > >>
>> > >>
>> > >>
>> > >> also, see here on the evils of MERT
>> > >> http://www.mail-archive.com/[email protected]/msg00216.html
>> > >>
>> > >> On 24/03/2011 16:06, Tom Hoar wrote:
>> > >>> We often run multiple trainings on the exact same bitext corpus but
>> > >>> pull different random samples for each run. We've observed
>> > >>> drastically
>> > >>> different BLEU scores between different runs with BLEUs ranging from
>> > >>> 30 to 45. This is from exactly the same training data except for the
>> > >>> randomly-pulled tuning and evaluation sets. We've assumed this
>> > >>> difference is due to both the random differences in the sets,
>> > >>> floating
>> > >>> point variations between various machines and not using
>> > >>> --predictable-seeds.
>> > >>>
>> > >>> Tom
>> > >>>
>> > >>>
>> > >>>
>> > >>> -----Original Message-----
>> > >>> *From*: Hieu Hoang <[email protected]
>> > >>> <mailto:hieu%20hoang%20%[email protected]%3e>>
>> > >>> *Reply-to*: [email protected] <mailto:[email protected]>
>> > >>> *To*: John Burger <[email protected]
>> > >>> <mailto:john%20burger%20%[email protected]%3e>>
>> > >>> *Cc*: Moses-support <[email protected]
>> > >>> <mailto:moses-support%20%[email protected]%3e>>
>> > >>> *Subject*: Re: [Moses-support] Nondeterminism during decoding: same
>> > >>> config, different n-best lists
>> > >>> *Date*: Thu, 24 Mar 2011 15:51:48 +0000
>> > >>>
>> > >>> there's little differences in floating point between OS and gcc
>> > >>> versions. One of the regression test fails because of rounding
>> > >>> errors,
>> > >>> depending on which machine you run it on. Other than truncating the
>> > >>> scores, there's not a lot we can do.
>> > >>>
>> > >>> The mert perl scripts also dabbles in the scores and that may be
>> > >>> another source of divergence
>> > >>>
>> > >>> On 24 March 2011 15:07, John Burger <[email protected]
>> > >>> <mailto:[email protected]>> wrote:
>> > >>>
>> > >>>     Lane Schwartz wrote:
>> > >>>
>> > >>>     > I've examined the n-best lists, and it seems there are at
>> > >>> least a
>> > >>>     > couple of interesting cases. In the simplest case, several
>> > >>>     > translations of a given sentence produce the exact same score,
>> > >>> and > these tied translations appear in different order during
>> > >>> different
>> > >>>
>> > >>>     > runs. This is a bit odd, but [not] terribly worrisome. The
>> > >>> stranger > case is when there are two different decoding runs, and
>> > >>> for
>> > >>> a given > sentence, there are translations that appear only in run
>> > >>> A,
>> > >>> and > different translations that only appear in run B.
>> > >>>
>> > >>>
>> > >>>     Both these cases are relevant to something we've occasionally
>> > >>> seen,
>> > >>>     which is non-determinism during =tuning=. This is not surprising
>> > >>>     given the above, since tuning of course involves decoding. It's
>> > >>> hard to reproduce, but we have sometimes seen very different weights
>> > >>> coming out of MERT for the exact same system configurations. The
>> > >>> problem here is that even very small differences in tuning can
>> > >>> result
>> > >>> in substantial differences in test results, because of how twitchy
>> > >>> BLEU
>> > >>> is.
>> > >>>
>> > >>>     Like many folks, we typically run MERT on a cluster. This brings
>> > >>> up
>> > >>>     another source of non-determinism we've theorized about. Some of
>> > >>> our clusters are heterogenous, and we've wondered if there might be
>> > >>> minor differences in floating point behavior from machine to
>> > >>> machine.
>> > >>> The assignment of different chunks of the tuning data to different
>> > >>> machines is typically non-deterministic, so this might carry over to
>> > >>> the actual weights that come out of MERT.
>> > >>>
>> > >>>     Does anyone know how robust the floating point usage in the
>> > >>> decoder
>> > >>> is under these circumstances?
>> > >>>
>> > >>>     Thanks.
>> > >>>
>> > >>>     - John Burger
>> > >>>     MITRE
>> > >>>
>> > >>>     _______________________________________________
>> > >>>     Moses-support mailing list
>> > >>>     [email protected] <mailto:[email protected]>
>> > >>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>> > >>>
>> > >>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> Moses-support mailing list
>> > >>> [email protected]  <mailto:[email protected]>
>> > >>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> > >>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> Moses-support mailing list
>> > >>> [email protected]  <mailto:[email protected]>
>> > >>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> > >>
>> > >> _______________________________________________
>> > >> Moses-support mailing list
>> > >> [email protected]  <mailto:[email protected]>
>> > >> http://mailman.mit.edu/mailman/listinfo/moses-support
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> Moses-support mailing list
>> > >> [email protected]
>> > >> http://mailman.mit.edu/mailman/listinfo/moses-support
>> > >
>> > > --
>> > > Suzy Howlett
>> > > http://www.showlett.id.au/
>> > > _______________________________________________
>> > > Moses-support mailing list
>> > > [email protected]
>> > > http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to