Hi, the corpus filtering script that you are using expects a parallel corpus in the format of two files, with corresponding lines referring to parallel sentences. Hence, they need to have the same number of lines.
You will get the quoted error message, if the two files have different number of lines, which is not the right starting point for this process. This may be bad data, or you have to run a sentence aligner first. -phi On Tue, Jul 16, 2013 at 6:48 AM, Cyrine NASRI <[email protected]> wrote: > > Hello, > > I'm trying to filter out long sentences using clean-corpus-n.pl, it dies > after a while saying "europarl.tok.fr is too short!" > > this what i do : > > clean-corpus-n.perl corpus.tok.low de en clean 1 50 > > Could someone please tell me if there is something obvious that I'm missing? > Regards, > > Cyrine > > > -- > Cyrine NASRI > Ph.D. Student in Computer Science > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
