Hi,

I am trying to do English-Chinese translation.
I've build a factored model successfully.
However, I am not quite clear about how to build a tree-based model after
reading the tutorial.

What I have in hand:
1. English-Chinese parallel corpus with 3 factors (surface, lemma and POS).
2. English-Chinses parallel corpus parsed with Stanford-Parser, and
formatted as XMLs in MOSES format.
3. The training command for my factored model is shown below:

$MOSES_DIR/scripts/training/train-model.perl \
-mgiza -mgiza-cpus 20 \
--root-dir train \
--corpus $WORK_DIR/en-ch.clean \
--f en \
--e ch \
--alignment grow-diag-final-and \
--reordering msd-bidirectional-fe \
--lm 0:3:$LANG_MOD_DIR/en-ch-surface.arpa.ch:8

 \
--lm 2:3:$LANG_MOD_DIR/en-ch-pos.arpa.ch:8

 \
--translation-factors 1,2-1,2+0-0,2 \
--generation-factors 1,2-0+0,2-0 \
--reordering-factors 0,2-0,2 \
--decoding-steps t0,g0:t1,g1 \
--external-bin-dir $MOSES_DIR/tools > $WORK_DIR/training.out 2>&1


The question is:
1. Can I use all the 3 factors when training tree-based model? If yes, how
the parallel corpus should be like? The XML format shown in the MOSES
tutorial seems not able to accept factors except surface.
2. I want to use trees on both source and target side, is it correct to add
the following arguments to train-model.perl?

--ghkm \
--source-syntax \
--target-syntax \
--LeftBinarize \

3. I noticed that after using Stanford-Parser to generate trees for
parallel corpus, the resulted trees might be 1 to many (or many to 1) for a
particular sentence. e.g., the sentence of source language is parsed into a
single tree, while the target language sentence is parsed into 2 trees.
Will this break the "parallel" property of parallel corpus?
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to