Steven Huang <d98922047@...> writes:
> 
> The question is:
> 1. Can I use all the 3 factors when training tree-based model? If yes, how
the parallel corpus should be like? The XML format shown in the MOSES
tutorial seems not able to accept factors except surface. 

I've successfully tested a toy syntactic models with factors, but there is
no systematic testing and I imagine many things won't work (what does: have
different factors for translation model and language model). The format in
my corpus was like this:

<tree label="sent"><tree label="root">c|x</tree><tree
label="root">b|y</tree><tree label="root">b|y</tree></tree>

> 2. I want to use trees on both source and target side, is it correct to
add the following arguments to train-model.perl?
> 
> 
> --ghkm \
> --source-syntax \
> --target-syntax \
> --LeftBinarize \

the GHKM implementation currently assumes string-to-tree (or tree-to-string)
rules, but I think you can try the hierarchical extractor (just leave out
'--ghkm') with both source and target syntax.

> 
> 3. I noticed that after using Stanford-Parser to generate trees for
parallel corpus, the resulted trees might be 1 to many (or many to 1) for a
particular sentence. e.g., the sentence of source language is parsed into a
single tree, while the target language sentence is parsed into 2 trees. Will
this break the "parallel" property of parallel corpus?

you'll need to ensure that you get one tree per sentence. Either you do some
post-processing and merge the two trees into one by creating a virtual root
node, or throw out theses sentence pairs.

hope this helps,
Rico


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to