Yes, my own experience is that more or less simultaneous update of the
dictionaries is the quickest option.
I usually work on a spreadsheet with words in decreasing order of
frequency, and I write a script that reads it and generates the XML code
for inserting in the dictionaries. It's quick and it avoids lots of silly
errors.
Hèctor

Missatge de Sevilay Bayatlı <sevilaybaya...@gmail.com> del dia dv., 9
d’abr. 2021 a les 11:58:

> Hi Anuradha,
> You need to update your proposal based on what Hèctor suggested, yeah it
> is better to work on both monodix and bidix simultaneously, but for a good
> lexicon, you need to take a small corpus and analysis the sentences and
> adding words.
>
> Sevilay
>
> On Thu, Apr 8, 2021 at 9:24 AM Anuradha Pandey <anuradha200...@gmail.com>
> wrote:
>
>> Thank you for your response, Hèctor. I read the proposal for the
>> Hindi-Bengali translator. There aren't open-source dictionaries for the
>> Bhojpuri language (though there are resources for getting a Bhojpuri
>> corpus), so I was using a hardcopy of a BHO-HIN dictionary for manually
>> adding the pairs. I did some rough calculations, and I shall be able to add
>> at least 8,000 words to the monodix. And, based on my experience with
>> Apertium, I think simultaneously adding words in the bidix makes the work
>> easier, so I think roughly the same number of words in the bidix too. But,
>> I don't think I will be able to achieve a WER below 20% with 8000 words.
>> Should I aim for a WER of nearly 30% then?
>>
>> Since the time for GSoC has been reduced, I am planning to modify my
>> proposal and the inputs from mentors would be extremely helpful.
>>
>> On Wed, 7 Apr 2021 at 20:24, Hèctor Alòs i Font <hectora...@gmail.com>
>> wrote:
>>
>>> Hi, Anuradha.
>>>
>>> Thanks for your proposal draft. First, I would like to tell you that if
>>> Apertium is a rule-based translation system, it is because this paradigm
>>> still makes sense for many languages (indeed, for the vast majority of
>>> them). If Bhojpuri has extensive electronic language resources and,
>>> particularly, bilingual linguistic corpora, then Apertium is probably not
>>> the best approach. But this is probably not the case. If it was, it would
>>> probably already be on Google Translate.
>>>
>>> As for the project. I would advise you to look at Gourab Chakraborty's
>>> proposal for a Hindi-Bengali translator and the comments on it. Most of the
>>> comments apply to your proposal as well. The following message would be
>>> useful to you, for instance:
>>> https://sourceforge.net/p/apertium/mailman/message/37251899/
>>>
>>> Your proposal seems to me unrealistic. 10,000 words in the monodix (and
>>> how many in the bidix?) are not enough for a WER below 20%, I think (maybe
>>> for two extremely close related languages).
>>>
>>> For better evaluation your proposal I'd like to find the answer for some
>>> basic questions:
>>>
>>> * Which is the current state of Bhojpuri language and, eventually,
>>> the Bhojpuri-Hindi language pair in Apertium?
>>> * Would you have to write a whole Bhojpuri morphological analyser from
>>> scratch and, afterwards, to add some 10,000 words manually assigning them
>>> to a given paradigm? How much time you'll need for this?
>>> * From where would you get the bilingual dictionary? Would you have to
>>> create it yourself? Are there freely available bilingual electronic
>>> dictionaries (like e.g. Wiktionary)?
>>> * Would you work on a Bhojpuri-to-Hindi translator or on a
>>> Hindi-to-Bhojpuri one? In any case there will be a quite a lot of work in
>>> the morphological disambiguation. But for one side you'll have it only
>>> once. If both Hindi-to-Bhojpuri and Hindi-to-Bengali are chosen (which is
>>> entirely possible), this work can be divided by the two projects.
>>>
>>> There is nothing wrong to this all this work by hand, if needed. It
>>> depends on the state of the language resources for the given language. But
>>> it is necessary to know to what extent you will have to do this
>>> time-consuming work.
>>>
>>> When we had twice the time in most of the cases the projects couldn't
>>> reach to create a working translator for a new language pair. In the
>>> current conditions, it is even more difficult.
>>>
>>> Hèctor
>>>
>>>
>>>
>>>
>>> Missatge de Anuradha Pandey <anuradha200...@gmail.com> del dia dc., 7
>>> d’abr. 2021 a les 16:28:
>>>
>>>> Hello everyone,
>>>> I am Anuradha Pandey, a sophomore student at BITS Pilani. I am
>>>> interested I participating in GSoC 2021, on the project - "*Develop a
>>>> prototype MT system for a strategic language pair*".
>>>>
>>>> I have prepared a rough draft for the same and I am planning to build
>>>> Bhojpuri(BHO)-Hindi(HIN) MT pair. I am improving my translation system for
>>>> the coding challenge and I will update my work on the GitHub repository
>>>> mentioned in the draft. It would be really helpful if I could get some
>>>> feedback before I make the final submission.
>>>>
>>>> Link to the draft -
>>>>
>>>> https://docs.google.com/document/d/1U19gJ3TMKYkYsp-FRthrvXkCRJUnNYSYKi46XhvZGOE/edit?usp=sharing
>>>>
>>>> Thanks & Regards,
>>>> Anuradha Pandey
>>>> IRC: Anuradha_Pandey
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to