Hi, I am trying to do English-Chinese translation. I've build a factored model successfully. However, I am not quite clear about how to build a tree-based model after reading the tutorial.
What I have in hand: 1. English-Chinese parallel corpus with 3 factors (surface, lemma and POS). 2. English-Chinses parallel corpus parsed with Stanford-Parser, and formatted as XMLs in MOSES format. 3. The training command for my factored model is shown below: $MOSES_DIR/scripts/training/train-model.perl \ -mgiza -mgiza-cpus 20 \ --root-dir train \ --corpus $WORK_DIR/en-ch.clean \ --f en \ --e ch \ --alignment grow-diag-final-and \ --reordering msd-bidirectional-fe \ --lm 0:3:$LANG_MOD_DIR/en-ch-surface.arpa.ch:8 \ --lm 2:3:$LANG_MOD_DIR/en-ch-pos.arpa.ch:8 \ --translation-factors 1,2-1,2+0-0,2 \ --generation-factors 1,2-0+0,2-0 \ --reordering-factors 0,2-0,2 \ --decoding-steps t0,g0:t1,g1 \ --external-bin-dir $MOSES_DIR/tools > $WORK_DIR/training.out 2>&1 The question is: 1. Can I use all the 3 factors when training tree-based model? If yes, how the parallel corpus should be like? The XML format shown in the MOSES tutorial seems not able to accept factors except surface. 2. I want to use trees on both source and target side, is it correct to add the following arguments to train-model.perl? --ghkm \ --source-syntax \ --target-syntax \ --LeftBinarize \ 3. I noticed that after using Stanford-Parser to generate trees for parallel corpus, the resulted trees might be 1 to many (or many to 1) for a particular sentence. e.g., the sentence of source language is parsed into a single tree, while the target language sentence is parsed into 2 trees. Will this break the "parallel" property of parallel corpus?
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
