2010/9/7 Hieu Hoang <[email protected]>: > you can create an extra language model from the monolingual corpus.
yes, i combine the target language side of the parallel corpus and the monolingual corpus for creating language models. what i wonder is how i can do the when creating generation models. on the page http://www.statmt.org/moses/?n=Moses.FactoredTutorial there is a mention: "To overcome this limitation, we may train generation models on large monolingual corpora, where we expect to see all possible word forms." but i didn't find any more specific instructions how to do that. > The factored model is really word-level so it can't use parse structures > from a parser. all tags of "my" parser are on word level, ranging from POS via semantic and morphology information to phrase role and dependencies. > If the parser emits word-level, POS tag-like categories for all words, > then you create a sequence models over those tags. If you do that, then > the translation model has to output these tag-like factors. I would use > just 1 translation model which outputs all the required factors, rather > than use a generation step. It's usually give better performance, unless > you have lots of OOV words. my target language (esperanto) is morphologically richer than my source language (english) and highly regular, using a limited number of affixes and endings to express case and number of nouns and pronouns, transitivity of verbs etc. therefore it is not unlikely for a certain word form, for example noun-plural-accusative (or verb-intransitive), not to exist on the target side of my small parallel corpus, although it exists as noun-plural-nominative (or verb-transitive) in the phrase table, especially on the lemma-lemma level. if the translation step of my model provides lemma and morphology (or number of daughter nouns, that i extracted using the dependency information during preprocessing) of that word form, it should be easy to find the correct surface form using the huge monolingual corpus. generation step using a huge corpus should also provide reliable translations of prepositions, given mother verb and daughter noun, etc, but i am not sure how it would affect translation of higher ngrams. I will also repeat my other two questions in case someone could answer them: >> Second, since my data is already tokenised, parsed, factorised and >> lowercased, how can I tell EMS to skip those steps and, if possible, >> evaluate the result without truecasing, detokenising and wrapping? >> >> Third, could minimising punctuation in the data (after segmentation, >> tokenisation and parsing) be a good idea for reducing sparseness of >> higher ngrams? The punctuation in my data is very diverse and >> inconsequent. regards, sonja > > On 07/09/2010 13:07, Sonja PETROVIĆ LUNDBERG wrote: >> Hi! >> >> I have a 2,5 million word parallel corpus and a 50 million word >> monolingual target language corpus, both deeply parsed using a >> Constraint Grammar parser. I am using the EMS to try different >> factored models. >> >> First, I wonder how I can use the much bigger monolingual corpus for >> training the generation step. Where in the config or meta files can I >> specify data to be used? >> >> Second, since my data is already tokenised, parsed, factorised and >> lowercased, how can I tell EMS to skip those steps and, if possible, >> evaluate the result without truecasing, detokenising and wrapping? >> >> Third, could minimising punctuation in the data (after segmentation, >> tokenisation and parsing) be a good idea for reducing sparseness of >> higher ngrams? The punctuation in my data is very diverse and >> inconsequent. >> >> Sonja >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
