Hi, Gourab. I don't know if you already got other reviews in the IRC channel. Here are my five cents:
1) Did you do the coding challenge? This is a must. 2) It would be good to know much about the current state of the hin-ben pair. Because there isn't any information on this in your proposal, I've taken a look at the repositories on GitHub. I've been surprised that there is no hin-ben yet created in the Apertium repository (although there is https://github.com/srj31/apertium-ben-hin) The hin monodix has 30,000+ entries and the ben monodix some 8,000. Furthermore, as I imagined, the morphological disambiguator for Hindi has very few rules (I guess they are not very necessary for translating to Urdu). So there is quite a lot of work. It'll be very hard to really create a translator with a WER below 25% (except if srj31's project has already quite a lot of work and may be used). 3) Are there any free sources than can be used to fill the bidix (e.g. the Wiktionary)? Or do you plan to translate by hand at least 10,000 Hindi words? (much better 12,000-14,000 words for getting a WER bellow 30%). How many words will you be able to translate per day? Only this would take most of your time. And, since there are only 8,000 words in the Bengali monodix, you'll need to add many of them in the Bengali monodix, which also needs quite a lot of time. Again the same question: we'll you need to create these words (and maybe the paradigms) in the monodix, or you'll be able to get many new words (and their association to Apertium paradigms) from free electronic sources? 4) In fact, your targets seem to be more a wish than something able. I recommend that you try to create a calendar per week, in order to better understand how much time you'll have to add words, create transfer rules, morphological disambiguation rules and lexical selection rules. I don't know anything on Indo-Iranian languages, but all Indo-European languages I know need quite a lot of work on morphological disambiguation and, despite this, it is one of the main sources of errors in the Apertium translators. You can take a look on this work plans: https://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd#Workplan https://wiki.apertium.org/wiki/User:Hectoralos/GSOC_2020_proposal:_French-Arpitan#Workplan (but take into account that in the previous years the number of hours devoted to a GSoC project were twice as high as this year's) 5) Why do you have to improve the Bengali morphological analyser? Why adding inflections for both Bangladeshi Bengali and Indian Bengali? The project is already too complex and overloaded to add the possibility of generating two flavours of Bengali (because it would be a matter of generating Bengali, not of parsing it for translating into Hindi). I would generate the Bengali that is currently in the Bengali monodix (the Indian one, I guess). Best, Hèctor Missatge de Gourab Chakraborty IIIT Dharwad <19bcs...@iiitdwd.ac.in> del dia dl., 29 de març 2021 a les 20:20: > Hi all, > I am planning to create the Apertium Hindi-Bengali language pair as per > the suggestions I was given by the developers. The GSoC application window > would begin soon, so I request the mentors to kindly give a review of my > final proposal, for any last minute changes that are required. > > Thanks a lot! > -- > Gourab Chakraborty > IRC: gourab337 > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff