Hi Hieu,
Thanks for the reply. It's good to hear you're a full-time
employee on the Asia Online team after completing your studies at
Edinburgh a few months ago.
I had some time and worked out the
details. I'll share here for others to try.
A brief summary: the
BerkeleyAligner output replaces steps 1 & 2 of train-model.perl. A few
additional steps that rename/gzip the BerkeleyAligner output allow
train-model.perl to continue from step 3.
Details: here's a breakdown
of the individual inputs and outputs of each of steps 1 to 9 steps in
train-model.perl with defaults for (M)GIZA++. There are other temporary
outputs but they're not listed here because they are not used as inputs
for subsequent steps from what I can tell. Syntax here is $command_line
== --command-line option, such as $corpus_dir == --corpus-dir, $giza_f2e
== --giza-f2e, $e == --e, etc.
Step 1 Inputs:
* $corpus.$f
*
$corpus.$e
Step 1 Outputs:
* $corpus_dir/$f.vcb
*
$corpus_dir/$e.vcb
* $corpus_dir/$f.vcb.classes
*
$corpus_dir/$e.vcb.classes
* $corpus_dir/$f-$e-int-train.snt
*
$corpus_dir/$e-$f-int-train.snt
Step 2 Inputs:
*
$corpus_dir/$f.vcb
* $corpus_dir/$e.vcb
*
$corpus_dir/$f.vcb.classes
* $corpus_dir/$e.vcb.classes
*
$corpus_dir/$f-$e-int-train.snt
*
$corpus_dir/$e-$f-int-train.snt
Step 2 Outputs:
*
$giza_f2e/$f-$e.$giza_extension.gz
*
$giza_e2f/$e-$f.$giza_extension.gz
Step 3 Inputs:
*
$giza_f2e/$f-$e.$giza-extension.gz
*
$giza_e2f/$e-$f.$giza-extension.gz
Step 3 Outputs:
*
$alignment_file.$alignment
Step 4 Inputs:
*
$alignment_file.$alignment
* $corpus.$f
* $corpus.$e
Step 4
Outputs:
* $lexical_file.f2e
* $lexical_file.e2f
Step 5 Inputs:
* $alignment_file.$alignment
* $corpus.$f
* $corpus.$e
Step 5
Outputs:
* $extract_file.gz
* $extract_file.inv.gz
*
$extract-file.o.gz (output optional. depends on --reordering
value)
Step 6 Inputs:
* $lexical_file.f2e
* $lexical_file.e2f
*
$extract_file.gz
* $extract_file.inv.gz
Step 6 Outputs:
*
$model_dir/rule-table.gz (if --hierarchical)
*
$model_dir/phrase-table.gz (if not --hierarchical)
Step 7 Inputs:
*
$extract_file.o.gz (optional. depends on --reordering value)
Step 7
Outputs:
* $model_dir/reordering-table-$xxxx.gz ($xxxx depends on
--reordering value)
Step 8 Inputs:
* $corpus.$e (optional. depends
on --generation-factors and other values)
Step 8 Outputs:
*
$model_dir/generation.$f
Step 9 Inputs:
* various path strings
generated for steps 6, 7 & 8, plus value of --lm and presence of --lm
file on the file system
Step 9 Outputs:
*
$model_dir/moses.ini
BerkeleyAligner can replace (M)GIZA++ in
train-model.perl steps 1 & 2. In doing so, BerkeleyAligner uses the same
inputs as steps 1 above and generates the same output as step 2 above.
The command line I've used looks like this:
~$ java -server -mx200m
-ea -jar berkeleyaligner.jar
-EMWordAligner.numThreads 4
-Data.trainSources $corpus.list
-Data.englishSuffix $e
-Data.foreignSuffix $f -Data.testSources
-exec.execDir $corpus-dir
-exec.create true
-Evaluator.writeGIZA true -Main.SaveParams true
-Main.alignTraining true
There's one exception to 100% compatibility.
BerkeleyAligner's GIZA++ compatible files are not compressed with gzip.
So, to continue with train-model.perl steps 3-9, a user must ensure the
GIZA++ files generated by Berkeley aligner have the same names expected
in step 3, including compressing with gzip. Therefore, after training is
complete, I move/rename/gzip the files to the locations expected by step
3.
I didn't investigate if BerkeleyAligner creates an alignment file
(output of step 3). If so, I suppose that my extra steps aren't
necessary and train-model.perl can point to it in steps 4 and 5. I'll
take another look.
Regards,
Tom
On Wed, 14 Dec 2011 12:55:24 +0700,
Hieu Hoang wrote:
probably step 4. From what I've seen, Berkeley
serves up the alignment ready to go straight into the extraction steps
---------- Forwarded message ----------
From: TOM HOAR
Date: 2
December 2011 09:42
Subject: [Moses-support] train-model.perl, (M)GIZA++
and BerkeleyAligner
To: Moses support
It looks like train-model.perl
uses steps 1 & 2 to train the word alignment files with (M)GIZA++. Does
BerkeleyAligner aligner replace both of these steps in training the word
alignment files. If so, then train-model.perl should pick up at step 3?
Thanks,
Tom
_______________________________________________
Moses-support mailing list
[email protected]
[3]
http://mailman.mit.edu/mailman/listinfo/moses-support [4]
Links:
------
[1] mailto:[email protected]
[2]
mailto:[email protected]
[3] mailto:[email protected]
[4]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support