Steven Huang <d98922047@...> writes: > > The question is: > 1. Can I use all the 3 factors when training tree-based model? If yes, how the parallel corpus should be like? The XML format shown in the MOSES tutorial seems not able to accept factors except surface.
I've successfully tested a toy syntactic models with factors, but there is no systematic testing and I imagine many things won't work (what does: have different factors for translation model and language model). The format in my corpus was like this: <tree label="sent"><tree label="root">c|x</tree><tree label="root">b|y</tree><tree label="root">b|y</tree></tree> > 2. I want to use trees on both source and target side, is it correct to add the following arguments to train-model.perl? > > > --ghkm \ > --source-syntax \ > --target-syntax \ > --LeftBinarize \ the GHKM implementation currently assumes string-to-tree (or tree-to-string) rules, but I think you can try the hierarchical extractor (just leave out '--ghkm') with both source and target syntax. > > 3. I noticed that after using Stanford-Parser to generate trees for parallel corpus, the resulted trees might be 1 to many (or many to 1) for a particular sentence. e.g., the sentence of source language is parsed into a single tree, while the target language sentence is parsed into 2 trees. Will this break the "parallel" property of parallel corpus? you'll need to ensure that you get one tree per sentence. Either you do some post-processing and merge the two trees into one by creating a virtual root node, or throw out theses sentence pairs. hope this helps, Rico _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
