hi sonja

you can create an extra language model from the monolingual corpus.

The factored model is really word-level so it can't use parse structures 
from a parser. You can try technique such as
http://www.mt-archive.info/WMT-2010-Bisazza.pdf
if you want to use constituency structures.

If the parser emits word-level, POS tag-like categories for all words, 
then you create a sequence models over those tags. If you do that, then 
the translation model has to output these tag-like factors. I would use 
just 1 translation model which outputs all the required factors, rather 
than use a generation step. It's usually give better performance, unless 
you have lots of OOV words.

On 07/09/2010 13:07, Sonja PETROVIĆ LUNDBERG wrote:
> Hi!
>
> I have a 2,5 million word parallel corpus and a 50 million word
> monolingual target language corpus, both deeply parsed using a
> Constraint Grammar parser. I am using the EMS to try different
> factored models.
>
> First, I wonder how I can use the much bigger monolingual corpus for
> training the generation step. Where in the config or meta files can I
> specify data to be used?
>
> Second, since my data is already tokenised, parsed, factorised and
> lowercased, how can I tell EMS to skip those steps and, if possible,
> evaluate the result without truecasing, detokenising and wrapping?
>
> Third, could minimising punctuation in the data (after segmentation,
> tokenisation and parsing) be a good idea for reducing sparseness of
> higher ngrams? The punctuation in my data is very diverse and
> inconsequent.
>
> Sonja
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to