Re: [Moses-support] Clean-corpus and tokenization problem

Philipp Koehn Tue, 09 Feb 2010 06:23:22 -0800

Hi,

which corpus are you talking about - where did you get it from
and what kind of processing did you do besides tokenization?


At some point the number of lines got out of sync, and you should
narrow down when that happened.

-phi

On Tue, Feb 9, 2010 at 9:24 AM, Pavani Y <[email protected]> wrote:

>  Hi All,
>
>
>
>             When we ran the tokenization and clean-corpus perl scripts
> across the European parliamentary corpus, there is a mismatch in the no of
> lines between the tokenized files ( euro.tok.en and euro.tok.fr). By clean
> corpus the lines which have more than 40 words are missing but the output
> files euro_clean.fr and euro_clean.en are same in the no of lines but the
> data is different. Attached are the problem and also the files which I have
> used.
>
> Is it ok if we give the training even though the lines don’t match like
> below?
>
>
>
>
>
>
>
>
> Fig: 1
>
>
>
>
> Fig 2:
>
>
>
> Fig 1 . is the snapshot of  euro.tok.en file and Fig 2. is the snapshot of
>  euro.tok.fr file. Both are differ in the no of lines.
>
>
>
> Regards,
>
> Pavani Yerra
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Clean-corpus and tokenization problem

Reply via email to