Regarding the coverage of Apertium (English-Bengali pair), the naïve coverage seems to be high enough (>70%) with precision above 99.6%, negligible word error rate. So, one option can be to further increase the coverage and another option could be to, as Hèctor suggested, create a new language pair (Bengali-Hindi) , that would be ready for publication. I personally would like to try out the second as creating a new pair for Apertium seems more interesting. So finally which direction should I take for GSoC? Also if I'll be proceeding with the second option, should I create an initial PR, for apertium-hin-ben?
Thanks! On Tue, Mar 23, 2021 at 10:04 AM Gourab Chakraborty IIIT Dharwad < 19bcs...@iiitdwd.ac.in> wrote: > Thanks a lot Hèctor for the feedback. I will change my proposal to > Creation of a language pair (Hindi-Bengali) that is ready for publication. > Also I'm working on the corpus coverage of -ben as Daniel suggested. I'm > focusing on apertium-ben for now, for the Hindi-Bengali language pair. Once > again, thanks a lot for the feedback! > > > On Tue, Mar 23, 2021 at 9:41 AM Hèctor Alòs i Font <hectora...@gmail.com> > wrote: > >> Hi Gourab, >> >> There has been, long time ago, some work on Bengali: >> Faridee AZM, Tyers FM (2009) Development of a morphological analyser for >> Bengali. In: Pérez-Ortiz J, Sánchez- >> Martínez F, Tyers F (eds) Proceedings of the First International Workshop >> on Free/Open-Source Rule-Based Ma- >> chine Translation, Universidad de Alicante. Departamento de Lenguajes y >> Sistemas Informáticos, Alicante, Spain, pp 43–50. >> >> You should see how much it covers, as Daniel said. If the basis is done, >> as I imagine, it would be more interesting to orient the proposal towards >> the creation of a pair that is ready for publication. We have quite a few >> parsers in different states of evolution, in particular for Indian >> languages, but relatively few realised pairs. It would be very interesting >> to have a "Bengali - another Indo-Iranian language" pair. Hindi-Bengali >> would probably be the best option, as Hindi and Urdu are, to date, the only >> languages that have been released in Apertium. Given that there is much >> less time available in GSoC this year, one option would be to work mainly >> in one direction. From Hindi to Bengali would be the easiest option because >> it would also avoid having to work a lot on morphological disambiguation >> (which should be more or less satisfactorily solved for Hindi). This would >> make the project concentrate on 1) finishing the morphological analysis of >> Bengali, 2) creating/expanding the transfer rules, 3) creating the lexical >> selection rules, 4) adding several thousand words in the bidix, 5) testing >> on real texts to fine-tune the translator and presenting a finished >> translator with a WER of less than 25%, ready for publication, at the end >> of the project. Least but not last, a Hindi-to-Bengali translator should >> be, as a rule, easier for a Bengali-speaker than creating the opposite >> direction. >> >> Hèctor >> >> Missatge de Daniel Swanson <awesomeevildu...@gmail.com> del dia dt., 23 >> de març 2021 a les 0:11: >> >>> Hi Gourab, >>> >>> My recommendation would be to evaluate the current status -ben and >>> -bn-en in terms of corpus coverage and WER and then incorporate into >>> your proposal what those numbers are now and how much you think you >>> can improve them. >>> >>> A pull request to one of the repositories involved would also be >>> worthwhile, both in terms of your understanding of how to accomplish >>> the tasks in your proposal and for the mentors to be able to evaluate >>> your proposal. >>> >>> Daniel >>> >>> On Mon, Mar 22, 2021 at 3:06 PM Gourab Chakraborty IIIT Dharwad >>> <19bcs...@iiitdwd.ac.in> wrote: >>> > >>> > >>> > Hi, >>> > I would like to participate in GSoC and am interested in contributing >>> in improving the transfer system for apertium-bn-en. My work would fall in >>> the "Develop a morphological analyser" category of the idea-list. I'm a >>> native speaker of Bengali and am really excited for the project. >>> > >>> > I have gone through the official documentation, and have already setup >>> apertium in my ubuntu system. >>> > >>> > I have prepared a draft for my GSoC proposal ( >>> https://docs.google.com/document/d/1S5EY6Eddu4v1ZMqgkM0Kjl_27kBhZkDkEz0Ddmnrotk/edit?usp=sharing). >>> Since this is my first proposal for GSoC, I would really appreciate any >>> feedback. Also what should I do next? >>> > >>> > Thank you >>> > -- >>> > Gourab Chakraborty (IRC: gourab337) >>> > _______________________________________________ >>> > Apertium-stuff mailing list >>> > Apertium-stuff@lists.sourceforge.net >>> > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >>> >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > > > -- > Gourab Chakraborty > 2nd year, CSE @ IIIT Dharwad > -- Gourab Chakraborty 2nd year, CSE @ IIIT Dharwad
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff