Missatge de Hèctor Alòs i Font <hectora...@gmail.com> del dia dl., 19 d’oct. 2020 a les 20:07:
> El dl., 19 oct. 2020, 19.58, Xavi Ivars <xavi.iv...@gmail.com> va > escriure: > >> Well, that's only "part" of the corpus... and for the Europarl, that part >> of corpus was not left "as is" after Apertium, but also postedited. >> > > Wow! Did you postedited the whole Europarl corpus?! No matter if you used > Apertium or not, it's clear that you did tons of work. If it is explained > somewhere how Softcatalà did the work, with how much resources (time, > volunteers, money), please let us know. It has to be an excellent test case > to show wether a (real) under-resourced language can or cannot reach the > stuff needed for neural translation. > No. The corpus was not postedited. It has 2 million sentences. I tried to get a Catalan translation as good as possible. What I did was: - Try to cover all relevant vocabulary: all non-capitalized words that appear at least 4-5 times in the corpus. - Fix spelling and grammar errors in the Spanish corpus using LanguageTool (for example, missing diacritics or agreement errors). The Spanish text is worse than expected. - Fix many common errors in spa-cat Apertium translation. This work is not complete. To finish it, we'll need probably 3-4 months of full-time work or more. Anyway, a neural translator can work even if a percentage of the corpus is not perfect. Jaume Ortolà
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff