Hi, one thing you can do here is to change the tokenization scheme based on the treetagger output, i.e. make a~la~derecha one word (using the tildes, for instance to glue the parts together).
-phi On Thu, Feb 12, 2009 at 1:10 PM, Michael Zuckerman <[email protected]> wrote: > Hello, > > We are trying to run factored training on spanish corpus. We first tag the > corpus with TreeTagger, change the format to "<word>|<lemma>|<tag> > <word>|<lemma>|<tag> ...", and then run the script > train-factored-phrase-model.perl on it. The problem arises when there are > phrases which are treated by TreeTagger as one word, for example > "a la derecha|a~la~derecha|adv". Then train-factored-phrase-model.perl says > that no factor was found for the word "a" and for the word "la" in the file. > Is there a way to tell the script that "a la derecha" should be treated as > one word ? > > Thanks, > Michael. > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
