Hi,

I added a number of improvements to MERT that have been recently
proposed in the literature, with the aim to support more features and
greater stability.

The improvements are:
(1) Optimization in random directions [Cer et al., 2008]
(2) Re-use of best weight settings from last n iterations as starting
points [Foster and Kuhn, 2009]
(3) Pairwise-Ranked Optimization [Hopkins and May, 2011]

To give some more details:

(1) Traditional MERT optimizes each parameter in isolation, finding
the best gain for any parameter, applying it, and repeating this process
until convergence. With the switch "-number-of-random-directions NUM",
in addition to these directions of exploring the multi-dimensional
weight space, a specified number of random directions are also explored.

(2) In each iteration of the running the decoder to produce n-best lists
and and optimizing weights, the first starting point is the last best weight
setting found. 20 additional starting points are randomly generated.
With the switch "-historic-best", the best found weights of each prior
iterations are used as starting points in addition to the random starting
points.

(3) A recent paper proposed an alternative to MERT that trains a classifier
to predict which of two candidates in the n-best list is better. Candidates
are randomly sampled (with a bias towards candidates with large metric
score differences) and passed to a standards linear model classifier
(maximum entropy, support vector machines, etc.). The current Moses
implementation uses MegaM by Hal Daume (check for license terms).
This alternative to traditional MERT can be used with the switch
"-pairwise-ranked".

Notes:

* the indicated switch are either specified when calling mert-moses.pl
 or in the parameter "tuning-settings" in EMS.

* option (3) is incompatible with (1) and (2), but the latter can be used
together.

* for "-number-of-random-directions" I used 50 random directions, which
 slows down MERT quite a bit.

* option (3) does not converge under the current Moses stopping criteria,
 so it runs for 25 iterations, but you may want to reduce this to 10 with
 the additional switch "-max-iterations 10"

Some results:
Urdu-English, SAMT Model

MERT settingiterationstuning settest setbaseline11.6 (std 4.8)22.73 (std
0.07)21.54 (std 0.38)50 random directions9.4 (std 2.3)22.82 (std 0.14)*21.58
* (std 0.38)rand.dir. + historic best9.2 (std 5.9)22.79 (std 0.23)21.40 (std
0.37)pairwise-ranked max-iter 1010-21.33 *(std 0.13)*

Urdu-English, Hierarchical Model

MERT settingiterationstuning settest setbaseline8.8 (std 2.2)23.91 (std
0.18)*23.02* (std 0.42)50 random directions8.4 (std 3.3)23.85 (std 0.35)22.80
(std 0.70)rand.dir. + historic best12.0 (std 3.5)24.03 (std 0.23)22.89 *(std
0.18)*pairwise-ranked max-iter 1010-21.93 (std 0.36)

German-English, Phrase-based

MERT settingiterationstuning settest setbaseline7.2 (std 14.3)24.82 (std
0.04)*21.29* (std 0.05)rand.dir. + historic best6.6 (std 1.8)24.88 (std
0.07)21.28 (std 0.16)pairwise-ranked max-iter 1010-*21.29 (std 0.02)*

German-English, Factored Backoff

MERT settingiterationstuning settest setbaseline12.0 (std 15.2)24.89 (std
0.25)21.35 (std 0.15)rand.dir. + historic best11.4 (std 7.6)25.01 (std
0.12)21.45
(std 0.12)pairwise-ranked25-*21.58 (std 0.11)*pairwise-ranked max-iter
1010-21.54
(std 0.10)

Results are reported over 5 runs of each optimization method, in terms of
average and standard deviation. What we are looking for is high test set
scores and low variance.

The Urdu-English systems use a smaller tuning set of less than a 1000
sentences
(with 4 references), so I would tend to give it less faith. Test set for
German-English
is WMT 2011.

Your milage may vary, but it is worth a tryout.

-phi
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to