Re: [Apertium-stuff] Cleaning Parallel Corpus

Kevin Brubeck Unhammer Thu, 29 Apr 2021 03:04:51 -0700

VIVEK VICKY <vivekvicky...@gmail.com>
čálii:

> Hello everyone,
> The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/,
> http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty lines
> in either languages due to splitting of a sentence into two or merging of
> two sentences after the translation, which is causing errors during
> lexical-training. Is it common in parallel corpora? or is there any clean
> parallel corpus out there?
> Right now, I am translating the sentences around[up and below] the empty
> lines and manually merging/splitting them. Is there any better way to do
> this?


Can you give an example? I took a look at that corpus and haven't found
any unmatched lines yet. Make sure you use the es-en.en file when
pairing es with en (that is, don't use cs-en.en with es-en.es).

(It *is* common to find semi-parallel corpora out there, but I suppose
we can leave sentence alignment out of the GsoC task unless
there's extra time, and assume corpora will be fairly clean.)

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Cleaning Parallel Corpus

Reply via email to