Hi,

one thing you can do here is to change the tokenization scheme based
on the treetagger output, i.e. make a~la~derecha one word (using the
tildes, for instance to glue the parts together).

-phi

On Thu, Feb 12, 2009 at 1:10 PM, Michael Zuckerman
<[email protected]> wrote:
> Hello,
>
> We are trying to run factored training on spanish corpus. We first tag the
> corpus with TreeTagger, change the format to "<word>|<lemma>|<tag>
> <word>|<lemma>|<tag> ...", and then run the script
> train-factored-phrase-model.perl on it. The problem arises when there are
> phrases which are treated by TreeTagger as one word, for example
> "a la derecha|a~la~derecha|adv". Then train-factored-phrase-model.perl says
> that no factor was found for the word "a" and for the word "la" in the file.
> Is there a way to tell the script that "a la derecha" should be treated as
> one word ?
>
> Thanks,
>      Michael.
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to