Missatge de Hèctor Alòs i Font <hectora...@gmail.com> del dia dl., 19
d’oct. 2020 a les 20:07:

> El dl., 19 oct. 2020, 19.58, Xavi Ivars <xavi.iv...@gmail.com> va
> escriure:
>
>> Well, that's only "part" of the corpus... and for the Europarl, that part
>> of corpus was not left "as is" after Apertium, but also postedited.
>>
>
> Wow! Did you postedited the whole Europarl corpus?! No matter if you used
> Apertium or not, it's clear that you did tons of work. If it is explained
> somewhere how Softcatalà did the work, with how much resources (time,
> volunteers, money), please let us know. It has to be an excellent test case
> to show wether a (real) under-resourced language can or cannot reach the
> stuff needed for neural translation.
>

No. The corpus was not postedited. It has 2 million sentences. I tried to
get a Catalan translation as good as possible. What I did was:

- Try to cover all relevant vocabulary: all non-capitalized words that
appear at least 4-5 times in the corpus.
- Fix spelling and grammar errors in the Spanish corpus using LanguageTool
(for example, missing diacritics or agreement errors). The Spanish text is
worse than expected.
- Fix many common errors in spa-cat Apertium translation.

This work is not complete. To finish it, we'll need probably 3-4 months of
full-time work or more. Anyway, a neural translator can work even if a
percentage of the corpus is not perfect.

Jaume Ortolà
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to