2010/9/7 Hieu Hoang <[email protected]>:
> you can create an extra language model from the monolingual corpus.

yes, i combine the target language side of the parallel corpus and the
monolingual corpus for creating language models. what i wonder is how
i can do the when creating generation models. on the page
http://www.statmt.org/moses/?n=Moses.FactoredTutorial there is a
mention:

"To overcome this limitation, we may train generation models on large
monolingual corpora, where we expect to see all possible word forms."

but i didn't find any more specific instructions how to do that.

> The factored model is really word-level so it can't use parse structures
> from a parser.

all tags of "my" parser are on word level, ranging from POS via
semantic and morphology information to phrase role and dependencies.

> If the parser emits word-level, POS tag-like categories for all words,
> then you create a sequence models over those tags. If you do that, then
> the translation model has to output these tag-like factors. I would use
> just 1 translation model which outputs all the required factors, rather
> than use a generation step. It's usually give better performance, unless
> you have lots of OOV words.

my target language (esperanto) is morphologically richer than my
source language (english) and highly regular, using a limited number
of affixes and endings to express case and number of nouns and
pronouns, transitivity of verbs etc. therefore it is not unlikely for
a certain word form, for example noun-plural-accusative (or
verb-intransitive), not to exist on the target side of my small
parallel corpus, although it exists as noun-plural-nominative (or
verb-transitive) in the phrase table, especially on the lemma-lemma
level. if the translation step of my model provides lemma and
morphology (or number of daughter nouns, that i extracted using the
dependency information during preprocessing) of that word form, it
should be easy to find the correct surface form using the huge
monolingual corpus. generation step using a huge corpus should also
provide reliable translations of prepositions, given mother verb and
daughter noun, etc, but i am not sure how it would affect translation
of higher ngrams.

I will also repeat my other two questions in case someone could answer them:

>> Second, since my data is already tokenised, parsed, factorised and
>> lowercased, how can I tell EMS to skip those steps and, if possible,
>> evaluate the result without truecasing, detokenising and wrapping?
>>
>> Third, could minimising punctuation in the data (after segmentation,
>> tokenisation and parsing) be a good idea for reducing sparseness of
>> higher ngrams? The punctuation in my data is very diverse and
>> inconsequent.

regards,
sonja



>
> On 07/09/2010 13:07, Sonja PETROVIĆ LUNDBERG wrote:
>> Hi!
>>
>> I have a 2,5 million word parallel corpus and a 50 million word
>> monolingual target language corpus, both deeply parsed using a
>> Constraint Grammar parser. I am using the EMS to try different
>> factored models.
>>
>> First, I wonder how I can use the much bigger monolingual corpus for
>> training the generation step. Where in the config or meta files can I
>> specify data to be used?
>>
>> Second, since my data is already tokenised, parsed, factorised and
>> lowercased, how can I tell EMS to skip those steps and, if possible,
>> evaluate the result without truecasing, detokenising and wrapping?
>>
>> Third, could minimising punctuation in the data (after segmentation,
>> tokenisation and parsing) be a good idea for reducing sparseness of
>> higher ngrams? The punctuation in my data is very diverse and
>> inconsequent.
>>
>> Sonja
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to