Re: [Apertium-stuff] Guidance for hin-pan language pair development

Hèctor Alòs i Font Sat, 21 Mar 2020 00:42:44 -0700

Hi Prinyak,

Yes, I now see that the Hindi गलत__adj paradigm is like this, and the
Punjabi ਗਲਤ__adj seems to be a copy of it.


I can only say that we do differently in the Romance languages I work with.
I can say that the "Hindi method" is bad. It works for Hindi-Urdu, doesn't
it? This makes morphological disambiguation harder, but probably transfer
is easier.

I agree with you that, since apertium-urd-hin is released, apertium-hin
should be quite reliable, so you should concentrate on Punjabi.
Nevertheless, according to my experience, it is not unusual that a language
package with just one released pair needs some improvement too. This
happens especially in cases like Urdu-Hindi, when the pair language is one
extremely close-related. For instance, if morphological disambiguation is
only superficially done, there won't be any problem for a translation into
Urdu because almost all the time the same ambiguity will exist in Urdu too.
But when translating to a less close-related language problems arise, and
more work on disambiguation has to be done.

Best,
Hèctor

Missatge de Priyank Modi <priyankmod...@gmail.com> del dia ds., 21 de març
2020 a les 9:22:

> By the way, it seems strange that you have 9 analyses for this adjective.
>> Usually in these cases we put only the first analysis in the dictionary.
>> The other, in really needed, can be added as <e r="RL">.
>
>
> Regarding this, I found a number of such anomalies in the Hindi monodix,
> and tried to resolve some of them by asking mentors on irc. But since
> urdu-hindi is a released pair(and hence the hindi monodix should have been
> reviewed) I have tried to add similar rules in the Punjabi monodix as well.
> This will have to be fixed in the final version. I guess following your
> suggestion, I'll add to my list of (possible) errors I find in current hin,
> hin-pan dictionaries and report the same in the proposal. This will also
> help me in getting quick feedback on most of these so that I can alteast
> bring the hindi monodix up to a reviewed and correct state during the
> duration between post-proposal and acceptance period. :D
>
> Does this look good?
> Thanks.
>
> On Sat, Mar 21, 2020 at 11:37 AM Priyank Modi <priyankmod...@gmail.com>
> wrote:
>
>> Hi Hector,
>> Thank you so much for taking time to look at my challenge in detail and
>> providing the feedback. I already understand this error and will work on
>> removing all '#' symbols in the final submission of my coding challenge. To
>> start with, the number of '#'s were atleast 3-4 times of what I have
>> currently. Quite a few of these still exist because these words were
>> already added to bidix but the monodix for Punjabi was almost empty when I
>> started off(u can check the original repo in the incubator).
>> Anyways, this has been really helpful and I'll make sure to improve on
>> this. Since you couldn't read the script, I should tell you that I'm able
>> to achieve close to human translation for most of these test sentences (as
>> said earlier, I'll be including an analysis in my proposal explaining the
>> translations in ipa, with which I'll need your help in reviewing as well 😬)
>>
>> I was able to find some dictionaries and parallel texts for both
>> languages. Is there anything else I can do right now? Could you help me
>> with some references on the use of case markers during translation as well?
>> :)
>>
>> Thank you again.
>>
>> Warm regards,
>> Priyank
>>
>>
>> On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, <hectora...@gmail.com>
>> wrote:
>>
>>> Hi Prinyak,
>>>
>>> I've been looking at you coding challenge. I can't understand anything,
>>> but I see the symbol # relatively often. That is annoying. See:
>>> http://wiki.apertium.org/wiki/Apertium_stream_format#Special
>>>
>>> This happens, for instance, when in the bidix the target word has a
>>> given gender and/or case, but in the monodix it has another. The lemma is
>>> recognized, but there isn't any information for generating the surface form
>>> as received from the bidix + transfer.
>>>
>>> Using apertium-viewer, I analysed this case:
>>>
>>> सब
>>> ^सब/सब<adj><mfn><sp>/सब<adj><m><sg><nom>/सब<adj><m><sg><obl>/सब<adj><m><pl><nom>/सब<adj><m><pl><obl>/सब<adj><f><sg><nom>/सब<adj><f><sg><obl>/सब<adj><f><pl><nom>/सब<adj><f><pl><obl>/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$
>>>
>>> ^सब/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$
>>> ^सब<prn><pers><p3><mf><pl><nom>$
>>> ^सब<prn><pers><p3><mf><pl><nom>/ਸਭ<prn><pers><p3><mf><pl><nom>$
>>> ^default<default>{^ਸਭ<prn><pers><p3><mf><pl><nom>$}$
>>> ^ਸਭ<prn><pers><p3><mf><pl><nom>$
>>> #ਸਭ
>>>
>>> As expected, the problem is that ^ਸਭ<prn><pers><p3><mf><pl><nom>$ cannot
>>> be generated.
>>>
>>> Then I do:
>>> apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
>>> "<ਸਭ>"
>>> "ਸਭ" adj mfn sp
>>> "ਸਭ" adj m sg nom
>>> "ਸਭ" adj m sg obl
>>> "ਸਭ" adj m pl nom
>>> "ਸਭ" adj m pl obl
>>> "ਸਭ" adj f sg nom
>>> "ਸਭ" adj f sg obl
>>> "ਸਭ" adj f pl nom
>>> "ਸਭ" adj f pl obl
>>> "<.>"
>>> "." sent
>>>
>>> So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but
>>> in the monodix is defined as an adjective.
>>>
>>> By the way, it seems strange that you have 9 analyses for this
>>> adjective. Usually in these cases we put only the first analysis in the
>>> dictionary. The other, in really needed, can be added as <e r="RL">.
>>>
>>> Best,
>>> Hèctor
>>>
>>>
>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 19 de
>>> març 2020 a les 0:29:
>>>
>>>> Hi Hector, Francis;
>>>> I've made progress on the coding challenge and wanted your* feedback *on
>>>> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
>>>> *(The bin files remained after a `make clean`, so I didn't remove them
>>>> from the repo, let me know if this is incorrect)*
>>>>
>>>> > I've attempted to translate the file already added in the original
>>>> repository
>>>> <https://github.com/apertium/apertium-hin-pan/tree/b8cea06c4748b24db7eb7e94b455a491425c04b5>
>>>> .
>>>> > Output file
>>>> <https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt>
>>>> > Right now, I'm fixing the few missing/un/incorrectly translated words
>>>> and focusing more on translating a full article which can be compared
>>>> against a benchmark(parallel text), using the techniques mentioned in the
>>>> section on Building dictionaries
>>>> <http://wiki.apertium.org/wiki/Building_dictionaries>. I'll be
>>>> mentioning the WER and coverage details in my proposal.
>>>> > As Hector mentioned last time, I've been able to find some parallel
>>>> texts and am asking others to free their resources. I was able to retrieve
>>>> a good corpus available at request(owned by the tourism department of the
>>>> state). Could someone *send me the terms for safely using a corpus*?
>>>> > Given that both Hindi and Punjabi have phonemic orthography, could we
>>>> use *fuzzy string matching*(simple string mapping in this case) to
>>>> translate proper nouns/borrowed words(at least single word NEs)?
>>>> > Finally, could you point out to me some *resources about the way
>>>> case markers and dependencies* are being used in the apertium model?
>>>> This could be crucial for this language pair because most of the POS
>>>> tagging and chunking revolves around the case markers and dependency
>>>> relations.
>>>>
>>>> Thank you so much for the support. Have a great day!
>>>>
>>>> Warn regards,
>>>> PM
>>>>
>>>> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font <
>>>> hectora...@gmail.com> wrote:
>>>>
>>>>> Hi Priyank,
>>>>>
>>>>> I calculated the coverage on the Wikipedia dumps I got, and which I
>>>>> used for getting the frequency lists. I think this is fair, since these
>>>>> corpora are enormous. But I calculated WER on the basis of other texts. I
>>>>> calculated it only a few times, at fixed project benchmarks, since I 
>>>>> needed
>>>>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
>>>>> pseudo-random "good" Wikipedia articles (the feature of the day and two
>>>>> more). I just took the introduction at the beginning. This ups to c. 1000
>>>>> words. Sometimes I took random front page news from top newspapers
>>>>> (typically, sociopolitical). In the final calculation, I got 4-5 short
>>>>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
>>>>> type of language I aimed. The idea has been to develop I tool for a more 
>>>>> or
>>>>> less under-resourced language, especially for helping the creation of
>>>>> Wikipedia articles.
>>>>>
>>>>> @Marc Riera Irigoyen <marc.riera.irigo...@gmail.com> has used another
>>>>> strategy for following the evolution of WER/PER (see
>>>>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got
>>>>> a reference text for the whole project and automatically tested against it
>>>>> at the end of every week. If you use this strategy, you have to be very
>>>>> disciplined and not be influenced by the mistakes you see in these tests
>>>>> (this means not adding certain words in dictionaries, or morphological
>>>>> disambiguation rules, lexical selection rules, or transfer rules because 
>>>>> of
>>>>> detected errors during this weekly tests). I am not really a good example
>>>>> of discipline at work, so I prefer to use the more manual, and more
>>>>> time-consuming, method that I have described above.
>>>>>
>>>>> Currently, I'm preparing my own proposal, and I'm doing as you. As
>>>>> you, my proposal includes a widely-used language, which is released in
>>>>> Apertium, and a (very) under-resourced language, unreleased in Apertium,
>>>>> which needs a lot of work. I have got a test text for both languages and
>>>>> I've added the needed words in the dictionaries, so that most of the text
>>>>> is translated. It is just a test, because still there are big errors due 
>>>>> to
>>>>> the lack of transfer rules (although, I've copied some useful transfer
>>>>> rules from another close-related language pair). I'm currently collecting
>>>>> resources: dictionaries, texts in the under-resourced language and
>>>>> bilingual texts (in my case, it is not so easy, because the 
>>>>> under-resourced
>>>>> language is really very under-resourced, there are several competing
>>>>> orthographies, and there is a very big dialect variety). I'm also seeing
>>>>> which major transfer rules have to be included. In your case, I suppose
>>>>> you'll use a 3-stage transfer, so you should plan what will have to be 
>>>>> done
>>>>> in each of stages 1, 2 and 3. This includes to plan which information
>>>>> should have the chunk headers created at stage 1. I guess, the Hindi-Urdu
>>>>> language pair can be a good possibility, but maybe something else would
>>>>> need to be added in the headers, since Hindi and Urdu are extremely closed
>>>>> languages, and Punjabi, as far as I know, is not so closed to Hindi.
>>>>>
>>>>> Best,
>>>>> Hèctor
>>>>>
>>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 12 de
>>>>> març 2020 a les 2:44:
>>>>>
>>>>>> Hi Hector,
>>>>>> Thank you so much for the reply. The proposals were really helpful.
>>>>>> I've completed the coding challenge for a small set of 10 sentences(for
>>>>>> now) which I believe Francis has added to the repo as a test set. I'll
>>>>>> included the same in the proposal. For now, I'm working on building the
>>>>>> dictionaries using the wiki dumps as mentioned in the documentation, 
>>>>>> adding
>>>>>> the most frequent words systematically.
>>>>>> Looking through your proposal, I noticed that you included metrics
>>>>>> like WER and coverage to determine progress. I just wanted to confirm if
>>>>>> these are being computed against the dumps one downloads for the 
>>>>>> respective
>>>>>> languages(which seems to be the case seeing the way you mentioned the 
>>>>>> same
>>>>>> in your own proposal)? Or is there some separate benchmark? This will be
>>>>>> helpful as I can then go ahead and mention the current state of the
>>>>>> dictionaries in a more statistical manner.
>>>>>>
>>>>>> Finally, is there something else I can do to make my proposal better?
>>>>>> Or is it advisable to start working on my proposal/some other non-entry
>>>>>> level project?
>>>>>>
>>>>>> Thank you for sharing the proposals and the guidance once again.
>>>>>> Have a great day!
>>>>>>
>>>>>> Warm regards,
>>>>>> PM
>>>>>>
>>>>>> --
>>>>>> Priyank Modi      ●  Undergrad Research Student
>>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>>> Mobile:  +91 83281 45692
>>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font <
>>>>>> hectora...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Priyank,
>>>>>>>
>>>>>>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual
>>>>>>> that closely related pairs give not very satisfactory results with 
>>>>>>> Google,
>>>>>>> because most of the time there is as an intermediate translation into
>>>>>>> English. In any case, if you can give some data about the quality of the
>>>>>>> Google translator (as I did in my 2019 GSoC application
>>>>>>> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>),
>>>>>>> it may be useful, I think.
>>>>>>>
>>>>>>> In order to present an application for a language-pair development
>>>>>>> it is required to pass the so called "coding challenge"
>>>>>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Coding_challenge>.
>>>>>>> Basically, this will show that you understand the basis of the 
>>>>>>> architecture
>>>>>>> and knows how to add new words in the dictionaries.
>>>>>>>
>>>>>>> For the project itself, you'll need to add many words to the Punjabi
>>>>>>> and Punjabi-Hindi dictionaries, transfer rules and lexical selection 
>>>>>>> rules.
>>>>>>> If you intend to translate from Punjabi, you'll need to work on
>>>>>>> morphological disambiguation, which needs at least a couple of weeks of
>>>>>>> work. This is basic, since plenty of errors in Indo-European languages
>>>>>>> (and, I guess, not only) come from bad morphological disambiguation.
>>>>>>> Usually, closed categories are added first in the dictionaries and
>>>>>>> afterwards words are mostly added using frequency lists. If there are 
>>>>>>> free
>>>>>>> resources you may use, this would be great, but it is absolutely 
>>>>>>> necessary
>>>>>>> not to automatically copy from copyrighted materials. For my own
>>>>>>> application this year, I'm asking people to free their resources in 
>>>>>>> order
>>>>>>> to be able to use them.
>>>>>>>
>>>>>>> You may be interested in previous applications for developing
>>>>>>> language pairs, for instance this one
>>>>>>> <http://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd>,
>>>>>>> in addition to mine last year.
>>>>>>>
>>>>>>> Best wishes,
>>>>>>> Hèctor
>>>>>>>
>>>>>>>
>>>>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dv., 6
>>>>>>> de març 2020 a les 23:49:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I am trying to work towards developing the Hindi-Punjabi pair and
>>>>>>>> needed some guidance on how to go about it. I ran the test files and 
>>>>>>>> could
>>>>>>>> notice that the dictionary file for Punjabi needs work(even a lot of
>>>>>>>> function words could not be found by the translator). Should I start 
>>>>>>>> with
>>>>>>>> that? Are there some tests each stage needs to pass? Also, finally what
>>>>>>>> sort of work is expected to make a decent GSOC proposal, of course 
>>>>>>>> I'll be
>>>>>>>> interested in developing this pair regardless since even Google 
>>>>>>>> translate
>>>>>>>> doesn't seem to work well for this pair(for the test set specifically 
>>>>>>>> the
>>>>>>>> apertium translator worked significantly better)
>>>>>>>> Any help would be appreciated.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Warm regards,
>>>>>>>> PM
>>>>>>>>
>>>>>>>> --
>>>>>>>> Priyank Modi       ●  Undergrad Research Student
>>>>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>>>>> Mobile:  +91 83281 45692
>>>>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Apertium-stuff mailing list
>>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Apertium-stuff mailing list
>>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Priyank Modi       ●  Undergrad Research Student
>>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>>> Mobile:  +91 83281 45692
>>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>> _______________________________________________
>>>>>> Apertium-stuff mailing list
>>>>>> Apertium-stuff@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>>
>>>>
>>>> --
>>>> Priyank Modi      ●  Undergrad Research Student
>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>> Mobile:  +91 83281 45692
>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>>
>
> --
> Priyank Modi      ●  Undergrad Research Student
> IIIT-Hyderabad        ●  Language Technologies Research Center
> Mobile:  +91 83281 45692
> Website <https://priyankmodipm.github.io/>    ●    Linkedin
> <https://www.linkedin.com/in/priyank-modi-81584b175/>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Guidance for hin-pan language pair development

Reply via email to