Hi! I have a 2,5 million word parallel corpus and a 50 million word monolingual target language corpus, both deeply parsed using a Constraint Grammar parser. I am using the EMS to try different factored models.
First, I wonder how I can use the much bigger monolingual corpus for training the generation step. Where in the config or meta files can I specify data to be used? Second, since my data is already tokenised, parsed, factorised and lowercased, how can I tell EMS to skip those steps and, if possible, evaluate the result without truecasing, detokenising and wrapping? Third, could minimising punctuation in the data (after segmentation, tokenisation and parsing) be a good idea for reducing sparseness of higher ngrams? The punctuation in my data is very diverse and inconsequent. Sonja _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
