Hi, I added a number of improvements to MERT that have been recently proposed in the literature, with the aim to support more features and greater stability.
The improvements are: (1) Optimization in random directions [Cer et al., 2008] (2) Re-use of best weight settings from last n iterations as starting points [Foster and Kuhn, 2009] (3) Pairwise-Ranked Optimization [Hopkins and May, 2011] To give some more details: (1) Traditional MERT optimizes each parameter in isolation, finding the best gain for any parameter, applying it, and repeating this process until convergence. With the switch "-number-of-random-directions NUM", in addition to these directions of exploring the multi-dimensional weight space, a specified number of random directions are also explored. (2) In each iteration of the running the decoder to produce n-best lists and and optimizing weights, the first starting point is the last best weight setting found. 20 additional starting points are randomly generated. With the switch "-historic-best", the best found weights of each prior iterations are used as starting points in addition to the random starting points. (3) A recent paper proposed an alternative to MERT that trains a classifier to predict which of two candidates in the n-best list is better. Candidates are randomly sampled (with a bias towards candidates with large metric score differences) and passed to a standards linear model classifier (maximum entropy, support vector machines, etc.). The current Moses implementation uses MegaM by Hal Daume (check for license terms). This alternative to traditional MERT can be used with the switch "-pairwise-ranked". Notes: * the indicated switch are either specified when calling mert-moses.pl or in the parameter "tuning-settings" in EMS. * option (3) is incompatible with (1) and (2), but the latter can be used together. * for "-number-of-random-directions" I used 50 random directions, which slows down MERT quite a bit. * option (3) does not converge under the current Moses stopping criteria, so it runs for 25 iterations, but you may want to reduce this to 10 with the additional switch "-max-iterations 10" Some results: Urdu-English, SAMT Model MERT settingiterationstuning settest setbaseline11.6 (std 4.8)22.73 (std 0.07)21.54 (std 0.38)50 random directions9.4 (std 2.3)22.82 (std 0.14)*21.58 * (std 0.38)rand.dir. + historic best9.2 (std 5.9)22.79 (std 0.23)21.40 (std 0.37)pairwise-ranked max-iter 1010-21.33 *(std 0.13)* Urdu-English, Hierarchical Model MERT settingiterationstuning settest setbaseline8.8 (std 2.2)23.91 (std 0.18)*23.02* (std 0.42)50 random directions8.4 (std 3.3)23.85 (std 0.35)22.80 (std 0.70)rand.dir. + historic best12.0 (std 3.5)24.03 (std 0.23)22.89 *(std 0.18)*pairwise-ranked max-iter 1010-21.93 (std 0.36) German-English, Phrase-based MERT settingiterationstuning settest setbaseline7.2 (std 14.3)24.82 (std 0.04)*21.29* (std 0.05)rand.dir. + historic best6.6 (std 1.8)24.88 (std 0.07)21.28 (std 0.16)pairwise-ranked max-iter 1010-*21.29 (std 0.02)* German-English, Factored Backoff MERT settingiterationstuning settest setbaseline12.0 (std 15.2)24.89 (std 0.25)21.35 (std 0.15)rand.dir. + historic best11.4 (std 7.6)25.01 (std 0.12)21.45 (std 0.12)pairwise-ranked25-*21.58 (std 0.11)*pairwise-ranked max-iter 1010-21.54 (std 0.10) Results are reported over 5 runs of each optimization method, in terms of average and standard deviation. What we are looking for is high test set scores and low variance. The Urdu-English systems use a smaller tuning set of less than a 1000 sentences (with 4 references), so I would tend to give it less faith. Test set for German-English is WMT 2011. Your milage may vary, but it is worth a tryout. -phi
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
