Hi!

I have a 2,5 million word parallel corpus and a 50 million word
monolingual target language corpus, both deeply parsed using a
Constraint Grammar parser. I am using the EMS to try different
factored models.

First, I wonder how I can use the much bigger monolingual corpus for
training the generation step. Where in the config or meta files can I
specify data to be used?

Second, since my data is already tokenised, parsed, factorised and
lowercased, how can I tell EMS to skip those steps and, if possible,
evaluate the result without truecasing, detokenising and wrapping?

Third, could minimising punctuation in the data (after segmentation,
tokenisation and parsing) be a good idea for reducing sparseness of
higher ngrams? The punctuation in my data is very diverse and
inconsequent.

Sonja
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to