hi sonja you can create an extra language model from the monolingual corpus.
The factored model is really word-level so it can't use parse structures from a parser. You can try technique such as http://www.mt-archive.info/WMT-2010-Bisazza.pdf if you want to use constituency structures. If the parser emits word-level, POS tag-like categories for all words, then you create a sequence models over those tags. If you do that, then the translation model has to output these tag-like factors. I would use just 1 translation model which outputs all the required factors, rather than use a generation step. It's usually give better performance, unless you have lots of OOV words. On 07/09/2010 13:07, Sonja PETROVIĆ LUNDBERG wrote: > Hi! > > I have a 2,5 million word parallel corpus and a 50 million word > monolingual target language corpus, both deeply parsed using a > Constraint Grammar parser. I am using the EMS to try different > factored models. > > First, I wonder how I can use the much bigger monolingual corpus for > training the generation step. Where in the config or meta files can I > specify data to be used? > > Second, since my data is already tokenised, parsed, factorised and > lowercased, how can I tell EMS to skip those steps and, if possible, > evaluate the result without truecasing, detokenising and wrapping? > > Third, could minimising punctuation in the data (after segmentation, > tokenisation and parsing) be a good idea for reducing sparseness of > higher ngrams? The punctuation in my data is very diverse and > inconsequent. > > Sonja > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
