Hi, which corpus are you talking about - where did you get it from and what kind of processing did you do besides tokenization?
At some point the number of lines got out of sync, and you should narrow down when that happened. -phi On Tue, Feb 9, 2010 at 9:24 AM, Pavani Y <[email protected]> wrote: > Hi All, > > > > When we ran the tokenization and clean-corpus perl scripts > across the European parliamentary corpus, there is a mismatch in the no of > lines between the tokenized files ( euro.tok.en and euro.tok.fr). By clean > corpus the lines which have more than 40 words are missing but the output > files euro_clean.fr and euro_clean.en are same in the no of lines but the > data is different. Attached are the problem and also the files which I have > used. > > Is it ok if we give the training even though the lines don’t match like > below? > > > > > > > > > Fig: 1 > > > > > Fig 2: > > > > Fig 1 . is the snapshot of euro.tok.en file and Fig 2. is the snapshot of > euro.tok.fr file. Both are differ in the no of lines. > > > > Regards, > > Pavani Yerra > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
