Re: [Apertium-stuff] Guidance for hin-pan language pair development

Priyank Modi Fri, 27 Mar 2020 14:04:11 -0700

Hi all,
I've completed the preliminary draft of my proposal and would really
appreciate your comments/suggestions on the same :
http://wiki.apertium.org/wiki/Pmodi/GSOC_2020_proposal:_Hindi-Punjabi


Francis(firstly sorry for cc'ing you personally), since you have been
managing the repo, could you review my coding challenge(I believe you know
the script).

Warm Regards,
Priyank Modi

On Sat, Mar 21, 2020 at 1:11 PM Hèctor Alòs i Font <hectora...@gmail.com>
wrote:

> Hi Prinyak,
>
> Yes, I now see that the Hindi गलत__adj paradigm is like this, and the
> Punjabi ਗਲਤ__adj seems to be a copy of it.
>
> I can only say that we do differently in the Romance languages I work
> with. I can say that the "Hindi method" is bad. It works for Hindi-Urdu,
> doesn't it? This makes morphological disambiguation harder, but probably
> transfer is easier.
>
> I agree with you that, since apertium-urd-hin is released, apertium-hin
> should be quite reliable, so you should concentrate on Punjabi.
> Nevertheless, according to my experience, it is not unusual that a language
> package with just one released pair needs some improvement too. This
> happens especially in cases like Urdu-Hindi, when the pair language is one
> extremely close-related. For instance, if morphological disambiguation is
> only superficially done, there won't be any problem for a translation into
> Urdu because almost all the time the same ambiguity will exist in Urdu too.
> But when translating to a less close-related language problems arise, and
> more work on disambiguation has to be done.
>
> Best,
> Hèctor
>
> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia ds., 21 de
> març 2020 a les 9:22:
>
>> By the way, it seems strange that you have 9 analyses for this adjective.
>>> Usually in these cases we put only the first analysis in the dictionary.
>>> The other, in really needed, can be added as <e r="RL">.
>>
>>
>> Regarding this, I found a number of such anomalies in the Hindi monodix,
>> and tried to resolve some of them by asking mentors on irc. But since
>> urdu-hindi is a released pair(and hence the hindi monodix should have been
>> reviewed) I have tried to add similar rules in the Punjabi monodix as well.
>> This will have to be fixed in the final version. I guess following your
>> suggestion, I'll add to my list of (possible) errors I find in current hin,
>> hin-pan dictionaries and report the same in the proposal. This will also
>> help me in getting quick feedback on most of these so that I can alteast
>> bring the hindi monodix up to a reviewed and correct state during the
>> duration between post-proposal and acceptance period. :D
>>
>> Does this look good?
>> Thanks.
>>
>> On Sat, Mar 21, 2020 at 11:37 AM Priyank Modi <priyankmod...@gmail.com>
>> wrote:
>>
>>> Hi Hector,
>>> Thank you so much for taking time to look at my challenge in detail and
>>> providing the feedback. I already understand this error and will work on
>>> removing all '#' symbols in the final submission of my coding challenge. To
>>> start with, the number of '#'s were atleast 3-4 times of what I have
>>> currently. Quite a few of these still exist because these words were
>>> already added to bidix but the monodix for Punjabi was almost empty when I
>>> started off(u can check the original repo in the incubator).
>>> Anyways, this has been really helpful and I'll make sure to improve on
>>> this. Since you couldn't read the script, I should tell you that I'm able
>>> to achieve close to human translation for most of these test sentences (as
>>> said earlier, I'll be including an analysis in my proposal explaining the
>>> translations in ipa, with which I'll need your help in reviewing as well 😬)
>>>
>>> I was able to find some dictionaries and parallel texts for both
>>> languages. Is there anything else I can do right now? Could you help me
>>> with some references on the use of case markers during translation as well?
>>> :)
>>>
>>> Thank you again.
>>>
>>> Warm regards,
>>> Priyank
>>>
>>>
>>> On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, <hectora...@gmail.com>
>>> wrote:
>>>
>>>> Hi Prinyak,
>>>>
>>>> I've been looking at you coding challenge. I can't understand anything,
>>>> but I see the symbol # relatively often. That is annoying. See:
>>>> http://wiki.apertium.org/wiki/Apertium_stream_format#Special
>>>>
>>>> This happens, for instance, when in the bidix the target word has a
>>>> given gender and/or case, but in the monodix it has another. The lemma is
>>>> recognized, but there isn't any information for generating the surface form
>>>> as received from the bidix + transfer.
>>>>
>>>> Using apertium-viewer, I analysed this case:
>>>>
>>>> सब
>>>> ^सब/सब<adj><mfn><sp>/सब<adj><m><sg><nom>/सब<adj><m><sg><obl>/सब<adj><m><pl><nom>/सब<adj><m><pl><obl>/सब<adj><f><sg><nom>/सब<adj><f><sg><obl>/सब<adj><f><pl><nom>/सब<adj><f><pl><obl>/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$
>>>>
>>>> ^सब/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$
>>>> ^सब<prn><pers><p3><mf><pl><nom>$
>>>> ^सब<prn><pers><p3><mf><pl><nom>/ਸਭ<prn><pers><p3><mf><pl><nom>$
>>>> ^default<default>{^ਸਭ<prn><pers><p3><mf><pl><nom>$}$
>>>> ^ਸਭ<prn><pers><p3><mf><pl><nom>$
>>>> #ਸਭ
>>>>
>>>> As expected, the problem is that ^ਸਭ<prn><pers><p3><mf><pl><nom>$
>>>> cannot be generated.
>>>>
>>>> Then I do:
>>>> apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
>>>> "<ਸਭ>"
>>>> "ਸਭ" adj mfn sp
>>>> "ਸਭ" adj m sg nom
>>>> "ਸਭ" adj m sg obl
>>>> "ਸਭ" adj m pl nom
>>>> "ਸਭ" adj m pl obl
>>>> "ਸਭ" adj f sg nom
>>>> "ਸਭ" adj f sg obl
>>>> "ਸਭ" adj f pl nom
>>>> "ਸਭ" adj f pl obl
>>>> "<.>"
>>>> "." sent
>>>>
>>>> So that's the problem: in the bidix it is said that ਸਭ is a pronoun,
>>>> but in the monodix is defined as an adjective.
>>>>
>>>> By the way, it seems strange that you have 9 analyses for this
>>>> adjective. Usually in these cases we put only the first analysis in the
>>>> dictionary. The other, in really needed, can be added as <e r="RL">.
>>>>
>>>> Best,
>>>> Hèctor
>>>>
>>>>
>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 19 de
>>>> març 2020 a les 0:29:
>>>>
>>>>> Hi Hector, Francis;
>>>>> I've made progress on the coding challenge and wanted your* feedback *on
>>>>> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
>>>>> *(The bin files remained after a `make clean`, so I didn't remove them
>>>>> from the repo, let me know if this is incorrect)*
>>>>>
>>>>> > I've attempted to translate the file already added in the original
>>>>> repository
>>>>> <https://github.com/apertium/apertium-hin-pan/tree/b8cea06c4748b24db7eb7e94b455a491425c04b5>
>>>>> .
>>>>> > Output file
>>>>> <https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt>
>>>>> > Right now, I'm fixing the few missing/un/incorrectly translated
>>>>> words and focusing more on translating a full article which can be 
>>>>> compared
>>>>> against a benchmark(parallel text), using the techniques mentioned in the
>>>>> section on Building dictionaries
>>>>> <http://wiki.apertium.org/wiki/Building_dictionaries>. I'll be
>>>>> mentioning the WER and coverage details in my proposal.
>>>>> > As Hector mentioned last time, I've been able to find some parallel
>>>>> texts and am asking others to free their resources. I was able to retrieve
>>>>> a good corpus available at request(owned by the tourism department of the
>>>>> state). Could someone *send me the terms for safely using a corpus*?
>>>>> > Given that both Hindi and Punjabi have phonemic orthography, could
>>>>> we use *fuzzy string matching*(simple string mapping in this case) to
>>>>> translate proper nouns/borrowed words(at least single word NEs)?
>>>>> > Finally, could you point out to me some *resources about the way
>>>>> case markers and dependencies* are being used in the apertium model?
>>>>> This could be crucial for this language pair because most of the POS
>>>>> tagging and chunking revolves around the case markers and dependency
>>>>> relations.
>>>>>
>>>>> Thank you so much for the support. Have a great day!
>>>>>
>>>>> Warn regards,
>>>>> PM
>>>>>
>>>>> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font <
>>>>> hectora...@gmail.com> wrote:
>>>>>
>>>>>> Hi Priyank,
>>>>>>
>>>>>> I calculated the coverage on the Wikipedia dumps I got, and which I
>>>>>> used for getting the frequency lists. I think this is fair, since these
>>>>>> corpora are enormous. But I calculated WER on the basis of other texts. I
>>>>>> calculated it only a few times, at fixed project benchmarks, since I 
>>>>>> needed
>>>>>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
>>>>>> pseudo-random "good" Wikipedia articles (the feature of the day and two
>>>>>> more). I just took the introduction at the beginning. This ups to c. 1000
>>>>>> words. Sometimes I took random front page news from top newspapers
>>>>>> (typically, sociopolitical). In the final calculation, I got 4-5 short
>>>>>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
>>>>>> type of language I aimed. The idea has been to develop I tool for a more 
>>>>>> or
>>>>>> less under-resourced language, especially for helping the creation of
>>>>>> Wikipedia articles.
>>>>>>
>>>>>> @Marc Riera Irigoyen <marc.riera.irigo...@gmail.com> has used
>>>>>> another strategy for following the evolution of WER/PER (see
>>>>>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got
>>>>>> a reference text for the whole project and automatically tested against 
>>>>>> it
>>>>>> at the end of every week. If you use this strategy, you have to be very
>>>>>> disciplined and not be influenced by the mistakes you see in these tests
>>>>>> (this means not adding certain words in dictionaries, or morphological
>>>>>> disambiguation rules, lexical selection rules, or transfer rules because 
>>>>>> of
>>>>>> detected errors during this weekly tests). I am not really a good example
>>>>>> of discipline at work, so I prefer to use the more manual, and more
>>>>>> time-consuming, method that I have described above.
>>>>>>
>>>>>> Currently, I'm preparing my own proposal, and I'm doing as you. As
>>>>>> you, my proposal includes a widely-used language, which is released in
>>>>>> Apertium, and a (very) under-resourced language, unreleased in Apertium,
>>>>>> which needs a lot of work. I have got a test text for both languages and
>>>>>> I've added the needed words in the dictionaries, so that most of the text
>>>>>> is translated. It is just a test, because still there are big errors due 
>>>>>> to
>>>>>> the lack of transfer rules (although, I've copied some useful transfer
>>>>>> rules from another close-related language pair). I'm currently collecting
>>>>>> resources: dictionaries, texts in the under-resourced language and
>>>>>> bilingual texts (in my case, it is not so easy, because the 
>>>>>> under-resourced
>>>>>> language is really very under-resourced, there are several competing
>>>>>> orthographies, and there is a very big dialect variety). I'm also seeing
>>>>>> which major transfer rules have to be included. In your case, I suppose
>>>>>> you'll use a 3-stage transfer, so you should plan what will have to be 
>>>>>> done
>>>>>> in each of stages 1, 2 and 3. This includes to plan which information
>>>>>> should have the chunk headers created at stage 1. I guess, the Hindi-Urdu
>>>>>> language pair can be a good possibility, but maybe something else would
>>>>>> need to be added in the headers, since Hindi and Urdu are extremely 
>>>>>> closed
>>>>>> languages, and Punjabi, as far as I know, is not so closed to Hindi.
>>>>>>
>>>>>> Best,
>>>>>> Hèctor
>>>>>>
>>>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 12
>>>>>> de març 2020 a les 2:44:
>>>>>>
>>>>>>> Hi Hector,
>>>>>>> Thank you so much for the reply. The proposals were really helpful.
>>>>>>> I've completed the coding challenge for a small set of 10 sentences(for
>>>>>>> now) which I believe Francis has added to the repo as a test set. I'll
>>>>>>> included the same in the proposal. For now, I'm working on building the
>>>>>>> dictionaries using the wiki dumps as mentioned in the documentation, 
>>>>>>> adding
>>>>>>> the most frequent words systematically.
>>>>>>> Looking through your proposal, I noticed that you included metrics
>>>>>>> like WER and coverage to determine progress. I just wanted to confirm if
>>>>>>> these are being computed against the dumps one downloads for the 
>>>>>>> respective
>>>>>>> languages(which seems to be the case seeing the way you mentioned the 
>>>>>>> same
>>>>>>> in your own proposal)? Or is there some separate benchmark? This will be
>>>>>>> helpful as I can then go ahead and mention the current state of the
>>>>>>> dictionaries in a more statistical manner.
>>>>>>>
>>>>>>> Finally, is there something else I can do to make my proposal
>>>>>>> better? Or is it advisable to start working on my proposal/some other
>>>>>>> non-entry level project?
>>>>>>>
>>>>>>> Thank you for sharing the proposals and the guidance once again.
>>>>>>> Have a great day!
>>>>>>>
>>>>>>> Warm regards,
>>>>>>> PM
>>>>>>>
>>>>>>> --
>>>>>>> Priyank Modi      ●  Undergrad Research Student
>>>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>>>> Mobile:  +91 83281 45692
>>>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font <
>>>>>>> hectora...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Priyank,
>>>>>>>>
>>>>>>>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is
>>>>>>>> usual that closely related pairs give not very satisfactory results 
>>>>>>>> with
>>>>>>>> Google, because most of the time there is as an intermediate 
>>>>>>>> translation
>>>>>>>> into English. In any case, if you can give some data about the quality 
>>>>>>>> of
>>>>>>>> the Google translator (as I did in my 2019 GSoC application
>>>>>>>> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>),
>>>>>>>> it may be useful, I think.
>>>>>>>>
>>>>>>>> In order to present an application for a language-pair development
>>>>>>>> it is required to pass the so called "coding challenge"
>>>>>>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Coding_challenge>.
>>>>>>>> Basically, this will show that you understand the basis of the 
>>>>>>>> architecture
>>>>>>>> and knows how to add new words in the dictionaries.
>>>>>>>>
>>>>>>>> For the project itself, you'll need to add many words to the
>>>>>>>> Punjabi and Punjabi-Hindi dictionaries, transfer rules and lexical
>>>>>>>> selection rules. If you intend to translate from Punjabi, you'll need 
>>>>>>>> to
>>>>>>>> work on morphological disambiguation, which needs at least a couple of
>>>>>>>> weeks of work. This is basic, since plenty of errors in Indo-European
>>>>>>>> languages (and, I guess, not only) come from bad morphological
>>>>>>>> disambiguation. Usually, closed categories are added first in the
>>>>>>>> dictionaries and afterwards words are mostly added using frequency 
>>>>>>>> lists.
>>>>>>>> If there are free resources you may use, this would be great, but it is
>>>>>>>> absolutely necessary not to automatically copy from copyrighted 
>>>>>>>> materials.
>>>>>>>> For my own application this year, I'm asking people to free their 
>>>>>>>> resources
>>>>>>>> in order to be able to use them.
>>>>>>>>
>>>>>>>> You may be interested in previous applications for developing
>>>>>>>> language pairs, for instance this one
>>>>>>>> <http://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd>,
>>>>>>>> in addition to mine last year.
>>>>>>>>
>>>>>>>> Best wishes,
>>>>>>>> Hèctor
>>>>>>>>
>>>>>>>>
>>>>>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dv., 6
>>>>>>>> de març 2020 a les 23:49:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I am trying to work towards developing the Hindi-Punjabi pair and
>>>>>>>>> needed some guidance on how to go about it. I ran the test files and 
>>>>>>>>> could
>>>>>>>>> notice that the dictionary file for Punjabi needs work(even a lot of
>>>>>>>>> function words could not be found by the translator). Should I start 
>>>>>>>>> with
>>>>>>>>> that? Are there some tests each stage needs to pass? Also, finally 
>>>>>>>>> what
>>>>>>>>> sort of work is expected to make a decent GSOC proposal, of course 
>>>>>>>>> I'll be
>>>>>>>>> interested in developing this pair regardless since even Google 
>>>>>>>>> translate
>>>>>>>>> doesn't seem to work well for this pair(for the test set specifically 
>>>>>>>>> the
>>>>>>>>> apertium translator worked significantly better)
>>>>>>>>> Any help would be appreciated.
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> Warm regards,
>>>>>>>>> PM
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Priyank Modi       ●  Undergrad Research Student
>>>>>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>>>>>> Mobile:  +91 83281 45692
>>>>>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Apertium-stuff mailing list
>>>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Apertium-stuff mailing list
>>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Priyank Modi       ●  Undergrad Research Student
>>>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>>>> Mobile:  +91 83281 45692
>>>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>>> _______________________________________________
>>>>>>> Apertium-stuff mailing list
>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Apertium-stuff mailing list
>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Priyank Modi      ●  Undergrad Research Student
>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>> Mobile:  +91 83281 45692
>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>>
>>
>> --
>> Priyank Modi      ●  Undergrad Research Student
>> IIIT-Hyderabad        ●  Language Technologies Research Center
>> Mobile:  +91 83281 45692
>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


-- 
Priyank Modi      ●  Undergrad Research Student
IIIT-Hyderabad        ●  Language Technologies Research Center
Mobile:  +91 83281 45692
Website <https://priyankmodipm.github.io/>    ●    Linkedin
<https://www.linkedin.com/in/priyank-modi-81584b175/>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Guidance for hin-pan language pair development

Reply via email to