Following up. Yes, BerkeleyAligner creates an alignment file.
However, its default file is significantly different from the one
generated by train-model.perl's step 3 using the (M)GIZA++ compatible
files generated by BerkeleyAligner. What are the reasons to use one
method over the other, or is one approach inherently wrong?
Tom
On
Wed, 14 Dec 2011 17:25:31 +0700, Tom Hoar wrote:
Hi Hieu,
Thanks
for the reply. It's good to hear you're a full-time employee on the Asia
Online team after completing your studies at Edinburgh a few months ago.
I had some time and worked out the details. I'll share here for others
to try.
A brief summary: the BerkeleyAligner output replaces steps 1 &
2 of train-model.perl. A few additional steps that rename/gzip the
BerkeleyAligner output allow train-model.perl to continue from step 3.
Details: here's a breakdown of the individual inputs and outputs of
each of steps 1 to 9 steps in train-model.perl with defaults for
(M)GIZA++. There are other temporary outputs but they're not listed here
because they are not used as inputs for subsequent steps from what I can
tell. Syntax here is $command_line == --command-line option, such as
$corpus_dir == --corpus-dir, $giza_f2e == --giza-f2e, $e == --e, etc.
Step 1 Inputs:
* $corpus.$f
* $corpus.$e
Step 1 Outputs:
*
$corpus_dir/$f.vcb
* $corpus_dir/$e.vcb
*
$corpus_dir/$f.vcb.classes
* $corpus_dir/$e.vcb.classes
*
$corpus_dir/$f-$e-int-train.snt
*
$corpus_dir/$e-$f-int-train.snt
Step 2 Inputs:
*
$corpus_dir/$f.vcb
* $corpus_dir/$e.vcb
*
$corpus_dir/$f.vcb.classes
* $corpus_dir/$e.vcb.classes
*
$corpus_dir/$f-$e-int-train.snt
*
$corpus_dir/$e-$f-int-train.snt
Step 2 Outputs:
*
$giza_f2e/$f-$e.$giza_extension.gz
*
$giza_e2f/$e-$f.$giza_extension.gz
Step 3 Inputs:
*
$giza_f2e/$f-$e.$giza-extension.gz
*
$giza_e2f/$e-$f.$giza-extension.gz
Step 3 Outputs:
*
$alignment_file.$alignment
Step 4 Inputs:
*
$alignment_file.$alignment
* $corpus.$f
* $corpus.$e
Step 4
Outputs:
* $lexical_file.f2e
* $lexical_file.e2f
Step 5 Inputs:
* $alignment_file.$alignment
* $corpus.$f
* $corpus.$e
Step 5
Outputs:
* $extract_file.gz
* $extract_file.inv.gz
*
$extract-file.o.gz (output optional. depends on --reordering
value)
Step 6 Inputs:
* $lexical_file.f2e
* $lexical_file.e2f
*
$extract_file.gz
* $extract_file.inv.gz
Step 6 Outputs:
*
$model_dir/rule-table.gz (if --hierarchical)
*
$model_dir/phrase-table.gz (if not --hierarchical)
Step 7 Inputs:
*
$extract_file.o.gz (optional. depends on --reordering value)
Step 7
Outputs:
* $model_dir/reordering-table-$xxxx.gz ($xxxx depends on
--reordering value)
Step 8 Inputs:
* $corpus.$e (optional. depends
on --generation-factors and other values)
Step 8 Outputs:
*
$model_dir/generation.$f
Step 9 Inputs:
* various path strings
generated for steps 6, 7 & 8, plus value of --lm and presence of --lm
file on the file system
Step 9 Outputs:
*
$model_dir/moses.ini
BerkeleyAligner can replace (M)GIZA++ in
train-model.perl steps 1 & 2. In doing so, BerkeleyAligner uses the same
inputs as steps 1 above and generates the same output as step 2 above.
The command line I've used looks like this:
~$ java -server -mx200m
-ea -jar berkeleyaligner.jar
-EMWordAligner.numThreads 4
-Data.trainSources $corpus.list
-Data.englishSuffix $e
-Data.foreignSuffix $f -Data.testSources
-exec.execDir $corpus-dir
-exec.create true
-Evaluator.writeGIZA true -Main.SaveParams true
-Main.alignTraining true
There's one exception to 100% compatibility.
BerkeleyAligner's GIZA++ compatible files are not compressed with gzip.
So, to continue with train-model.perl steps 3-9, a user must ensure the
GIZA++ files generated by Berkeley aligner have the same names expected
in step 3, including compressing with gzip. Therefore, after training is
complete, I move/rename/gzip the files to the locations expected by step
3.
I didn't investigate if BerkeleyAligner creates an alignment file
(output of step 3). If so, I suppose that my extra steps aren't
necessary and train-model.perl can point to it in steps 4 and 5. I'll
take another look.
Regards,
Tom
On Wed, 14 Dec 2011 12:55:24 +0700,
Hieu Hoang wrote:
probably step 4. From what I've seen, Berkeley
serves up the alignment ready to go straight into the extraction steps
---------- Forwarded message ----------
From: TOM HOAR
Date: 2
December 2011 09:42
Subject: [Moses-support] train-model.perl, (M)GIZA++
and BerkeleyAligner
To: Moses support
It looks like train-model.perl
uses steps 1 & 2 to train the word alignment files with (M)GIZA++. Does
BerkeleyAligner aligner replace both of these steps in training the word
alignment files. If so, then train-model.perl should pick up at step 3?
Thanks,
Tom
_______________________________________________
Moses-support mailing list
[email protected]
[3]
http://mailman.mit.edu/mailman/listinfo/moses-support [4]
Links:
------
[1] mailto:[email protected]
[2]
mailto:[email protected]
[3] mailto:[email protected]
[4]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support