Awesome๐๐ I will try it out. Thanks!! On Thu, 29 Apr, 2021, 11:31 pm Tanmai Khanna, <khanna.tan...@gmail.com> wrote:
> Since you have only about 5-8 such sentences for every 2000 lines, and it > seems like empty lines are a reliable marker for these kind of situations, > something I would do is to prune the corpus and remove any empty line along > with two lines before and two lines after it from both the english and > spanish corpus. You'd lose some sentences to train on but that would be > negligible and the remaining corpus would be aligned. > > Just a thought > > *เคคเคจเฅเคฎเคฏ เคเคจเฅเคจเคพ * > *Tanmai Khanna* > > > On Thu, Apr 29, 2021 at 6:23 PM VIVEK VICKY <vivekvicky...@gmail.com> > wrote: > >> >> On Thu, Apr 29, 2021 at 3:35 PM Kevin Brubeck Unhammer <unham...@fsfe.org> >> wrote: >> >>> VIVEK VICKY <vivekvicky...@gmail.com> >>> ฤรกlii: >>> >>> > Hello everyone, >>> > The eng-spa parallel corpora I am using( >>> http://www.statmt.org/europarl/, >>> > http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz), have empty >>> lines >>> > in either languages due to splitting of a sentence into two or merging >>> of >>> > two sentences after the translation, which is causing errors during >>> > lexical-training. Is it common in parallel corpora? or is there any >>> clean >>> > parallel corpus out there? >>> > Right now, I am translating the sentences around[up and below] the >>> empty >>> > lines and manually merging/splitting them. Is there any better way to >>> do >>> > this? >>> >>> Can you give an example? I took a look at that corpus and haven't found >>> any unmatched lines yet. >> >> In Europarl's spa-eng corpus, in eng text line number 104, " now he is >> doing just the same" is shifted to line 105 in spa text. This is just one >> example[look for empty lines in both langs]. There are around 5-8 such >> sentences for every 2000 >> >> Make sure you use the es-en.en file when >>> pairing es with en (that is, don't use cs-en.en with es-en.es). >>> >> Yes, indeed >> >> >>> (It *is* common to find semi-parallel corpora out there, but I suppose >>> we can leave sentence alignment out of the GsoC task unless >>> there's extra time, and assume corpora will be fairly clean.) >>> >> We won't get valid rules if we train on semi-parallel corpora right? as >> our script assumes sentences are perfectly aligned >> PS: These corpora are perfectly sentence-aligned, except for FEW which >> are just split or merged in the other language. Hence, blank lines >> >> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff