Re: [Moses-support] Fwd: train-model.perl, (M)GIZA++ and BerkeleyAligner

Tom Hoar Wed, 14 Dec 2011 02:26:53 -0800


Hi Hieu,


Thanks for the reply. It's good to hear you're a full-time
employee on the Asia Online team after completing your studies at
Edinburgh a few months ago. 

I had some time and worked out the
details. I'll share here for others to try. 

A brief summary: the
BerkeleyAligner output replaces steps 1 & 2 of train-model.perl. A few
additional steps that rename/gzip the BerkeleyAligner output allow
train-model.perl to continue from step 3. 

Details: here's a breakdown
of the individual inputs and outputs of each of steps 1 to 9 steps in
train-model.perl with defaults for (M)GIZA++. There are other temporary
outputs but they're not listed here because they are not used as inputs
for subsequent steps from what I can tell. Syntax here is $command_line
== --command-line option, such as $corpus_dir == --corpus-dir, $giza_f2e
== --giza-f2e, $e == --e, etc. 

Step 1 Inputs: 

        * $corpus.$f
        *
$corpus.$e

Step 1 Outputs: 

        * $corpus_dir/$f.vcb
        *
$corpus_dir/$e.vcb
        * $corpus_dir/$f.vcb.classes
        *
$corpus_dir/$e.vcb.classes
        * $corpus_dir/$f-$e-int-train.snt
        *
$corpus_dir/$e-$f-int-train.snt

Step 2 Inputs: 

        *
$corpus_dir/$f.vcb
        * $corpus_dir/$e.vcb
        *
$corpus_dir/$f.vcb.classes
        * $corpus_dir/$e.vcb.classes
        *
$corpus_dir/$f-$e-int-train.snt
        *
$corpus_dir/$e-$f-int-train.snt

Step 2 Outputs: 

        *
$giza_f2e/$f-$e.$giza_extension.gz
        *
$giza_e2f/$e-$f.$giza_extension.gz

Step 3 Inputs: 

        *
$giza_f2e/$f-$e.$giza-extension.gz
        *
$giza_e2f/$e-$f.$giza-extension.gz

Step 3 Outputs: 

        *
$alignment_file.$alignment

Step 4 Inputs: 

        *
$alignment_file.$alignment
        * $corpus.$f
        * $corpus.$e

Step 4
Outputs: 

        * $lexical_file.f2e
        * $lexical_file.e2f

Step 5 Inputs:


        * $alignment_file.$alignment
        * $corpus.$f
        * $corpus.$e

Step 5
Outputs: 

        * $extract_file.gz
        * $extract_file.inv.gz
        *
$extract-file.o.gz (output optional. depends on --reordering
value)

Step 6 Inputs: 

        * $lexical_file.f2e
        * $lexical_file.e2f
        *
$extract_file.gz
        * $extract_file.inv.gz

Step 6 Outputs: 

        *
$model_dir/rule-table.gz (if --hierarchical)
        *
$model_dir/phrase-table.gz (if not --hierarchical)

Step 7 Inputs: 

        *
$extract_file.o.gz (optional. depends on --reordering value)

Step 7
Outputs: 

        * $model_dir/reordering-table-$xxxx.gz ($xxxx depends on
--reordering value)

Step 8 Inputs: 

        * $corpus.$e (optional. depends
on --generation-factors and other values)

Step 8 Outputs: 

        *
$model_dir/generation.$f

Step 9 Inputs: 

        * various path strings
generated for steps 6, 7 & 8, plus value of --lm and presence of --lm
file on the file system

Step 9 Outputs: 

        *
$model_dir/moses.ini

BerkeleyAligner can replace (M)GIZA++ in
train-model.perl steps 1 & 2. In doing so, BerkeleyAligner uses the same
inputs as steps 1 above and generates the same output as step 2 above.
The command line I've used looks like this: 

~$ java -server -mx200m
-ea -jar berkeleyaligner.jar 
 -EMWordAligner.numThreads 4
-Data.trainSources $corpus.list 
 -Data.englishSuffix $e
-Data.foreignSuffix $f -Data.testSources 
 -exec.execDir $corpus-dir
-exec.create true 
 -Evaluator.writeGIZA true -Main.SaveParams true 

-Main.alignTraining true 

There's one exception to 100% compatibility.
BerkeleyAligner's GIZA++ compatible files are not compressed with gzip.
So, to continue with train-model.perl steps 3-9, a user must ensure the
GIZA++ files generated by Berkeley aligner have the same names expected
in step 3, including compressing with gzip. Therefore, after training is
complete, I move/rename/gzip the files to the locations expected by step
3. 

I didn't investigate if BerkeleyAligner creates an alignment file
(output of step 3). If so, I suppose that my extra steps aren't
necessary and train-model.perl can point to it in steps 4 and 5. I'll
take another look. 

Regards,
Tom 

On Wed, 14 Dec 2011 12:55:24 +0700,
Hieu Hoang wrote:  

probably step 4. From what I've seen, Berkeley
serves up the alignment ready to go straight into the extraction steps


---------- Forwarded message ----------
 From: TOM HOAR 
 Date: 2
December 2011 09:42
Subject: [Moses-support] train-model.perl, (M)GIZA++
and BerkeleyAligner
To: Moses support 

It looks like train-model.perl
uses steps 1 & 2 to train the word alignment files with (M)GIZA++. Does
BerkeleyAligner aligner replace both of these steps in training the word
alignment files. If so, then train-model.perl should pick up at step 3?


Thanks,
Tom 
_______________________________________________

Moses-support mailing list
[email protected]
[3]
http://mailman.mit.edu/mailman/listinfo/moses-support [4] 


Links:
------
[1] mailto:[email protected]
[2]
mailto:[email protected]
[3] mailto:[email protected]
[4]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Fwd: train-model.perl, (M)GIZA++ and BerkeleyAligner

Reply via email to