Following up. Yes, BerkeleyAligner creates an alignment file.
However, its default file is significantly different from the one
generated by train-model.perl's step 3 using the (M)GIZA++ compatible
files generated by BerkeleyAligner. What are the reasons to use one
method over the other, or is one approach inherently wrong? 

Tom 

On
Wed, 14 Dec 2011 17:25:31 +0700, Tom Hoar  wrote:  

Hi Hieu, 

Thanks
for the reply. It's good to hear you're a full-time employee on the Asia
Online team after completing your studies at Edinburgh a few months ago.


I had some time and worked out the details. I'll share here for others
to try. 

A brief summary: the BerkeleyAligner output replaces steps 1 &
2 of train-model.perl. A few additional steps that rename/gzip the
BerkeleyAligner output allow train-model.perl to continue from step 3.


Details: here's a breakdown of the individual inputs and outputs of
each of steps 1 to 9 steps in train-model.perl with defaults for
(M)GIZA++. There are other temporary outputs but they're not listed here
because they are not used as inputs for subsequent steps from what I can
tell. Syntax here is $command_line == --command-line option, such as
$corpus_dir == --corpus-dir, $giza_f2e == --giza-f2e, $e == --e, etc.


Step 1 Inputs: 

        * $corpus.$f
        * $corpus.$e

Step 1 Outputs: 

        *
$corpus_dir/$f.vcb
        * $corpus_dir/$e.vcb
        *
$corpus_dir/$f.vcb.classes
        * $corpus_dir/$e.vcb.classes
        *
$corpus_dir/$f-$e-int-train.snt
        *
$corpus_dir/$e-$f-int-train.snt

Step 2 Inputs: 

        *
$corpus_dir/$f.vcb
        * $corpus_dir/$e.vcb
        *
$corpus_dir/$f.vcb.classes
        * $corpus_dir/$e.vcb.classes
        *
$corpus_dir/$f-$e-int-train.snt
        *
$corpus_dir/$e-$f-int-train.snt

Step 2 Outputs: 

        *
$giza_f2e/$f-$e.$giza_extension.gz
        *
$giza_e2f/$e-$f.$giza_extension.gz

Step 3 Inputs: 

        *
$giza_f2e/$f-$e.$giza-extension.gz
        *
$giza_e2f/$e-$f.$giza-extension.gz

Step 3 Outputs: 

        *
$alignment_file.$alignment

Step 4 Inputs: 

        *
$alignment_file.$alignment
        * $corpus.$f
        * $corpus.$e

Step 4
Outputs: 

        * $lexical_file.f2e
        * $lexical_file.e2f

Step 5 Inputs:


        * $alignment_file.$alignment
        * $corpus.$f
        * $corpus.$e

Step 5
Outputs: 

        * $extract_file.gz
        * $extract_file.inv.gz
        *
$extract-file.o.gz (output optional. depends on --reordering
value)

Step 6 Inputs: 

        * $lexical_file.f2e
        * $lexical_file.e2f
        *
$extract_file.gz
        * $extract_file.inv.gz

Step 6 Outputs: 

        *
$model_dir/rule-table.gz (if --hierarchical)
        *
$model_dir/phrase-table.gz (if not --hierarchical)

Step 7 Inputs: 

        *
$extract_file.o.gz (optional. depends on --reordering value)

Step 7
Outputs: 

        * $model_dir/reordering-table-$xxxx.gz ($xxxx depends on
--reordering value)

Step 8 Inputs: 

        * $corpus.$e (optional. depends
on --generation-factors and other values)

Step 8 Outputs: 

        *
$model_dir/generation.$f

Step 9 Inputs: 

        * various path strings
generated for steps 6, 7 & 8, plus value of --lm and presence of --lm
file on the file system

Step 9 Outputs: 

        *
$model_dir/moses.ini

BerkeleyAligner can replace (M)GIZA++ in
train-model.perl steps 1 & 2. In doing so, BerkeleyAligner uses the same
inputs as steps 1 above and generates the same output as step 2 above.
The command line I've used looks like this: 

~$ java -server -mx200m
-ea -jar berkeleyaligner.jar 
 -EMWordAligner.numThreads 4
-Data.trainSources $corpus.list 
 -Data.englishSuffix $e
-Data.foreignSuffix $f -Data.testSources 
 -exec.execDir $corpus-dir
-exec.create true 
 -Evaluator.writeGIZA true -Main.SaveParams true 

-Main.alignTraining true 

There's one exception to 100% compatibility.
BerkeleyAligner's GIZA++ compatible files are not compressed with gzip.
So, to continue with train-model.perl steps 3-9, a user must ensure the
GIZA++ files generated by Berkeley aligner have the same names expected
in step 3, including compressing with gzip. Therefore, after training is
complete, I move/rename/gzip the files to the locations expected by step
3. 

I didn't investigate if BerkeleyAligner creates an alignment file
(output of step 3). If so, I suppose that my extra steps aren't
necessary and train-model.perl can point to it in steps 4 and 5. I'll
take another look. 

Regards,
Tom 

On Wed, 14 Dec 2011 12:55:24 +0700,
Hieu Hoang wrote:  

probably step 4. From what I've seen, Berkeley
serves up the alignment ready to go straight into the extraction steps


---------- Forwarded message ----------
 From: TOM HOAR 
 Date: 2
December 2011 09:42
Subject: [Moses-support] train-model.perl, (M)GIZA++
and BerkeleyAligner
To: Moses support 

It looks like train-model.perl
uses steps 1 & 2 to train the word alignment files with (M)GIZA++. Does
BerkeleyAligner aligner replace both of these steps in training the word
alignment files. If so, then train-model.perl should pick up at step 3?


Thanks,
Tom 
_______________________________________________

Moses-support mailing list
[email protected]
[3]
http://mailman.mit.edu/mailman/listinfo/moses-support [4]  


Links:
------
[1] mailto:[email protected]
[2]
mailto:[email protected]
[3] mailto:[email protected]
[4]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to