VIVEK VICKY <vivekvicky...@gmail.com> čálii: > Hello everyone, > The eng-spa parallel corpora I am using(http://www.statmt.org/europarl/, > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty lines > in either languages due to splitting of a sentence into two or merging of > two sentences after the translation, which is causing errors during > lexical-training. Is it common in parallel corpora? or is there any clean > parallel corpus out there? > Right now, I am translating the sentences around[up and below] the empty > lines and manually merging/splitting them. Is there any better way to do > this?
Can you give an example? I took a look at that corpus and haven't found any unmatched lines yet. Make sure you use the es-en.en file when pairing es with en (that is, don't use cs-en.en with es-en.es). (It *is* common to find semi-parallel corpora out there, but I suppose we can leave sentence alignment out of the GsoC task unless there's extra time, and assume corpora will be fairly clean.)
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff