Re: [Apertium-stuff] Review for Apertium Hin-Ben

Hèctor Alòs i Font Tue, 30 Mar 2021 12:51:15 -0700

Hi, Gourab.

I don't know if you already got other reviews in the IRC channel. Here are
my five cents:


1) Did you do the coding challenge? This is a must.

2) It would be good to know much about the current state of the hin-ben
pair. Because there isn't any information on this in your proposal, I've
taken a look at the repositories on GitHub. I've been surprised that there
is no hin-ben yet created in the Apertium repository (although there is
https://github.com/srj31/apertium-ben-hin) The hin monodix has 30,000+
entries and the ben monodix some 8,000. Furthermore, as I imagined, the
morphological disambiguator for Hindi has very few rules (I guess they are
not very necessary for translating to Urdu).

So there is quite a lot of work. It'll be very hard to really create a
translator with a WER below 25% (except if srj31's project has already
quite a lot of work and may be used).

3) Are there any free sources than can be used to fill the bidix (e.g. the
Wiktionary)? Or do you plan to translate by hand at least 10,000 Hindi
words? (much better 12,000-14,000 words for getting a WER bellow 30%). How
many words will you be able to translate per day? Only this would take most
of your time. And, since there are only 8,000 words in the Bengali monodix,
you'll need to add many of them in the Bengali monodix, which also needs
quite a lot of time. Again the same question: we'll you need to create
these words (and maybe the paradigms) in the monodix, or you'll be able to
get many new words (and their association to Apertium paradigms) from free
electronic sources?

4) In fact, your targets seem to be more a wish than something able. I
recommend that you try to create a calendar per week, in order to better
understand how much time you'll have to add words, create transfer rules,
morphological disambiguation rules and lexical selection rules. I don't
know anything on Indo-Iranian languages, but all Indo-European languages I
know need quite a lot of work on morphological disambiguation and, despite
this, it is one of the main sources of errors in the Apertium translators.

You can take a look on this work plans:
https://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd#Workplan

https://wiki.apertium.org/wiki/User:Hectoralos/GSOC_2020_proposal:_French-Arpitan#Workplan
(but take into account that in the previous years the number of hours
devoted to a GSoC project were twice as high as this year's)

5) Why do you have to improve the Bengali morphological analyser? Why
adding inflections for both Bangladeshi Bengali and Indian Bengali? The
project is already too complex and overloaded to add the possibility of
generating two flavours of Bengali (because it would be a matter of
generating Bengali, not of parsing it for translating into Hindi). I would
generate the Bengali that is currently in the Bengali monodix (the Indian
one, I guess).

Best,
Hèctor

Missatge de Gourab Chakraborty IIIT Dharwad <[email protected]> del
dia dl., 29 de març 2021 a les 20:20:

> Hi all,
> I am planning to create the Apertium Hindi-Bengali language pair as per
> the suggestions I was given by the developers. The GSoC application window
> would begin soon, so I request the mentors to kindly give a review of my
> final proposal, for any last minute changes that are required.
>
> Thanks a lot!
> --
> Gourab Chakraborty
> IRC: gourab337
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Review for Apertium Hin-Ben

Reply via email to