Yes, my own experience is that more or less simultaneous update of the dictionaries is the quickest option. I usually work on a spreadsheet with words in decreasing order of frequency, and I write a script that reads it and generates the XML code for inserting in the dictionaries. It's quick and it avoids lots of silly errors. Hèctor
Missatge de Sevilay Bayatlı <sevilaybaya...@gmail.com> del dia dv., 9 d’abr. 2021 a les 11:58: > Hi Anuradha, > You need to update your proposal based on what Hèctor suggested, yeah it > is better to work on both monodix and bidix simultaneously, but for a good > lexicon, you need to take a small corpus and analysis the sentences and > adding words. > > Sevilay > > On Thu, Apr 8, 2021 at 9:24 AM Anuradha Pandey <anuradha200...@gmail.com> > wrote: > >> Thank you for your response, Hèctor. I read the proposal for the >> Hindi-Bengali translator. There aren't open-source dictionaries for the >> Bhojpuri language (though there are resources for getting a Bhojpuri >> corpus), so I was using a hardcopy of a BHO-HIN dictionary for manually >> adding the pairs. I did some rough calculations, and I shall be able to add >> at least 8,000 words to the monodix. And, based on my experience with >> Apertium, I think simultaneously adding words in the bidix makes the work >> easier, so I think roughly the same number of words in the bidix too. But, >> I don't think I will be able to achieve a WER below 20% with 8000 words. >> Should I aim for a WER of nearly 30% then? >> >> Since the time for GSoC has been reduced, I am planning to modify my >> proposal and the inputs from mentors would be extremely helpful. >> >> On Wed, 7 Apr 2021 at 20:24, Hèctor Alòs i Font <hectora...@gmail.com> >> wrote: >> >>> Hi, Anuradha. >>> >>> Thanks for your proposal draft. First, I would like to tell you that if >>> Apertium is a rule-based translation system, it is because this paradigm >>> still makes sense for many languages (indeed, for the vast majority of >>> them). If Bhojpuri has extensive electronic language resources and, >>> particularly, bilingual linguistic corpora, then Apertium is probably not >>> the best approach. But this is probably not the case. If it was, it would >>> probably already be on Google Translate. >>> >>> As for the project. I would advise you to look at Gourab Chakraborty's >>> proposal for a Hindi-Bengali translator and the comments on it. Most of the >>> comments apply to your proposal as well. The following message would be >>> useful to you, for instance: >>> https://sourceforge.net/p/apertium/mailman/message/37251899/ >>> >>> Your proposal seems to me unrealistic. 10,000 words in the monodix (and >>> how many in the bidix?) are not enough for a WER below 20%, I think (maybe >>> for two extremely close related languages). >>> >>> For better evaluation your proposal I'd like to find the answer for some >>> basic questions: >>> >>> * Which is the current state of Bhojpuri language and, eventually, >>> the Bhojpuri-Hindi language pair in Apertium? >>> * Would you have to write a whole Bhojpuri morphological analyser from >>> scratch and, afterwards, to add some 10,000 words manually assigning them >>> to a given paradigm? How much time you'll need for this? >>> * From where would you get the bilingual dictionary? Would you have to >>> create it yourself? Are there freely available bilingual electronic >>> dictionaries (like e.g. Wiktionary)? >>> * Would you work on a Bhojpuri-to-Hindi translator or on a >>> Hindi-to-Bhojpuri one? In any case there will be a quite a lot of work in >>> the morphological disambiguation. But for one side you'll have it only >>> once. If both Hindi-to-Bhojpuri and Hindi-to-Bengali are chosen (which is >>> entirely possible), this work can be divided by the two projects. >>> >>> There is nothing wrong to this all this work by hand, if needed. It >>> depends on the state of the language resources for the given language. But >>> it is necessary to know to what extent you will have to do this >>> time-consuming work. >>> >>> When we had twice the time in most of the cases the projects couldn't >>> reach to create a working translator for a new language pair. In the >>> current conditions, it is even more difficult. >>> >>> Hèctor >>> >>> >>> >>> >>> Missatge de Anuradha Pandey <anuradha200...@gmail.com> del dia dc., 7 >>> d’abr. 2021 a les 16:28: >>> >>>> Hello everyone, >>>> I am Anuradha Pandey, a sophomore student at BITS Pilani. I am >>>> interested I participating in GSoC 2021, on the project - "*Develop a >>>> prototype MT system for a strategic language pair*". >>>> >>>> I have prepared a rough draft for the same and I am planning to build >>>> Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for >>>> the coding challenge and I will update my work on the GitHub repository >>>> mentioned in the draft. It would be really helpful if I could get some >>>> feedback before I make the final submission. >>>> >>>> Link to the draft - >>>> >>>> https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing >>>> >>>> Thanks & Regards, >>>> Anuradha Pandey >>>> IRC: Anuradha_Pandey >>>> _______________________________________________ >>>> Apertium-stuff mailing list >>>> Apertium-stuff@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>> >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff