Re: [Apertium-stuff] Cleaning Parallel Corpus

VIVEK VICKY Thu, 29 Apr 2021 11:16:44 -0700

Awesome😀😀 I will try it out. Thanks!!

On Thu, 29 Apr, 2021, 11:31 pm Tanmai Khanna, <khanna.tan...@gmail.com>
wrote:


> Since you have only about 5-8 such sentences for every 2000 lines, and it
> seems like empty lines are a reliable marker for these kind of situations,
> something I would do is to prune the corpus and remove any empty line along
> with two lines before and two lines after it from both the english and
> spanish corpus. You'd lose some sentences to train on but that would be
> negligible and the remaining corpus would be aligned.
>
> Just a thought
>
> *तन्मय खन्ना *
> *Tanmai Khanna*
>
>
> On Thu, Apr 29, 2021 at 6:23 PM VIVEK VICKY <vivekvicky...@gmail.com>
> wrote:
>
>>
>> On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer <unham...@fsfe.org>
>> wrote:
>>
>>> VIVEK VICKY <vivekvicky...@gmail.com>
>>> čálii:
>>>
>>> > Hello everyone,
>>> > The eng-spa parallel corpora I am using(
>>> http://www.statmt.org/europarl/,
>>> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty
>>> lines
>>> > in either languages due to splitting of a sentence into two or merging
>>> of
>>> > two sentences after the translation, which is causing errors during
>>> > lexical-training. Is it common in parallel corpora? or is there any
>>> clean
>>> > parallel corpus out there?
>>> > Right now, I am translating the sentences around[up and below] the
>>> empty
>>> > lines and manually merging/splitting them. Is there any better way to
>>> do
>>> > this?
>>>
>>> Can you give an example? I took a look at that corpus and haven't found
>>> any unmatched lines yet.
>>
>>  In Europarl's spa-eng corpus, in eng text line number 104, " now he is
>> doing just the same" is shifted to line 105 in spa text. This is just one
>> example[look for empty lines in both langs]. There are around 5-8 such
>> sentences for every 2000
>>
>> Make sure you use the es-en.en file when
>>> pairing es with en (that is, don't use cs-en.en with es-en.es).
>>>
>>  Yes, indeed
>>
>>
>>> (It *is* common to find semi-parallel corpora out there, but I suppose
>>> we can leave sentence alignment out of the GsoC task unless
>>> there's extra time, and assume corpora will be fairly clean.)
>>>
>> We won't get valid rules if we train on semi-parallel corpora right? as
>> our script assumes sentences are perfectly aligned
>> PS: These corpora are perfectly sentence-aligned, except for FEW which
>> are just split or merged in the other language. Hence, blank lines
>>
>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Cleaning Parallel Corpus

Reply via email to