Hi Prinyak, Yes, I now see that the Hindi गलत__adj paradigm is like this, and the Punjabi ਗਲਤ__adj seems to be a copy of it.
I can only say that we do differently in the Romance languages I work with. I can say that the "Hindi method" is bad. It works for Hindi-Urdu, doesn't it? This makes morphological disambiguation harder, but probably transfer is easier. I agree with you that, since apertium-urd-hin is released, apertium-hin should be quite reliable, so you should concentrate on Punjabi. Nevertheless, according to my experience, it is not unusual that a language package with just one released pair needs some improvement too. This happens especially in cases like Urdu-Hindi, when the pair language is one extremely close-related. For instance, if morphological disambiguation is only superficially done, there won't be any problem for a translation into Urdu because almost all the time the same ambiguity will exist in Urdu too. But when translating to a less close-related language problems arise, and more work on disambiguation has to be done. Best, Hèctor Missatge de Priyank Modi <priyankmod...@gmail.com> del dia ds., 21 de març 2020 a les 9:22: > By the way, it seems strange that you have 9 analyses for this adjective. >> Usually in these cases we put only the first analysis in the dictionary. >> The other, in really needed, can be added as <e r="RL">. > > > Regarding this, I found a number of such anomalies in the Hindi monodix, > and tried to resolve some of them by asking mentors on irc. But since > urdu-hindi is a released pair(and hence the hindi monodix should have been > reviewed) I have tried to add similar rules in the Punjabi monodix as well. > This will have to be fixed in the final version. I guess following your > suggestion, I'll add to my list of (possible) errors I find in current hin, > hin-pan dictionaries and report the same in the proposal. This will also > help me in getting quick feedback on most of these so that I can alteast > bring the hindi monodix up to a reviewed and correct state during the > duration between post-proposal and acceptance period. :D > > Does this look good? > Thanks. > > On Sat, Mar 21, 2020 at 11:37 AM Priyank Modi <priyankmod...@gmail.com> > wrote: > >> Hi Hector, >> Thank you so much for taking time to look at my challenge in detail and >> providing the feedback. I already understand this error and will work on >> removing all '#' symbols in the final submission of my coding challenge. To >> start with, the number of '#'s were atleast 3-4 times of what I have >> currently. Quite a few of these still exist because these words were >> already added to bidix but the monodix for Punjabi was almost empty when I >> started off(u can check the original repo in the incubator). >> Anyways, this has been really helpful and I'll make sure to improve on >> this. Since you couldn't read the script, I should tell you that I'm able >> to achieve close to human translation for most of these test sentences (as >> said earlier, I'll be including an analysis in my proposal explaining the >> translations in ipa, with which I'll need your help in reviewing as well 😬) >> >> I was able to find some dictionaries and parallel texts for both >> languages. Is there anything else I can do right now? Could you help me >> with some references on the use of case markers during translation as well? >> :) >> >> Thank you again. >> >> Warm regards, >> Priyank >> >> >> On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, <hectora...@gmail.com> >> wrote: >> >>> Hi Prinyak, >>> >>> I've been looking at you coding challenge. I can't understand anything, >>> but I see the symbol # relatively often. That is annoying. See: >>> http://wiki.apertium.org/wiki/Apertium_stream_format#Special >>> >>> This happens, for instance, when in the bidix the target word has a >>> given gender and/or case, but in the monodix it has another. The lemma is >>> recognized, but there isn't any information for generating the surface form >>> as received from the bidix + transfer. >>> >>> Using apertium-viewer, I analysed this case: >>> >>> सब >>> ^सब/सब<adj><mfn><sp>/सब<adj><m><sg><nom>/सब<adj><m><sg><obl>/सब<adj><m><pl><nom>/सब<adj><m><pl><obl>/सब<adj><f><sg><nom>/सब<adj><f><sg><obl>/सब<adj><f><pl><nom>/सब<adj><f><pl><obl>/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$ >>> >>> ^सब/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$ >>> ^सब<prn><pers><p3><mf><pl><nom>$ >>> ^सब<prn><pers><p3><mf><pl><nom>/ਸਭ<prn><pers><p3><mf><pl><nom>$ >>> ^default<default>{^ਸਭ<prn><pers><p3><mf><pl><nom>$}$ >>> ^ਸਭ<prn><pers><p3><mf><pl><nom>$ >>> #ਸਭ >>> >>> As expected, the problem is that ^ਸਭ<prn><pers><p3><mf><pl><nom>$ cannot >>> be generated. >>> >>> Then I do: >>> apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam >>> "<ਸਭ>" >>> "ਸਭ" adj mfn sp >>> "ਸਭ" adj m sg nom >>> "ਸਭ" adj m sg obl >>> "ਸਭ" adj m pl nom >>> "ਸਭ" adj m pl obl >>> "ਸਭ" adj f sg nom >>> "ਸਭ" adj f sg obl >>> "ਸਭ" adj f pl nom >>> "ਸਭ" adj f pl obl >>> "<.>" >>> "." sent >>> >>> So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but >>> in the monodix is defined as an adjective. >>> >>> By the way, it seems strange that you have 9 analyses for this >>> adjective. Usually in these cases we put only the first analysis in the >>> dictionary. The other, in really needed, can be added as <e r="RL">. >>> >>> Best, >>> Hèctor >>> >>> >>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 19 de >>> març 2020 a les 0:29: >>> >>>> Hi Hector, Francis; >>>> I've made progress on the coding challenge and wanted your* feedback *on >>>> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi >>>> *(The bin files remained after a `make clean`, so I didn't remove them >>>> from the repo, let me know if this is incorrect)* >>>> >>>> > I've attempted to translate the file already added in the original >>>> repository >>>> <https://github.com/apertium/apertium-hin-pan/tree/b8cea06c4748b24db7eb7e94b455a491425c04b5> >>>> . >>>> > Output file >>>> <https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt> >>>> > Right now, I'm fixing the few missing/un/incorrectly translated words >>>> and focusing more on translating a full article which can be compared >>>> against a benchmark(parallel text), using the techniques mentioned in the >>>> section on Building dictionaries >>>> <http://wiki.apertium.org/wiki/Building_dictionaries>. I'll be >>>> mentioning the WER and coverage details in my proposal. >>>> > As Hector mentioned last time, I've been able to find some parallel >>>> texts and am asking others to free their resources. I was able to retrieve >>>> a good corpus available at request(owned by the tourism department of the >>>> state). Could someone *send me the terms for safely using a corpus*? >>>> > Given that both Hindi and Punjabi have phonemic orthography, could we >>>> use *fuzzy string matching*(simple string mapping in this case) to >>>> translate proper nouns/borrowed words(at least single word NEs)? >>>> > Finally, could you point out to me some *resources about the way >>>> case markers and dependencies* are being used in the apertium model? >>>> This could be crucial for this language pair because most of the POS >>>> tagging and chunking revolves around the case markers and dependency >>>> relations. >>>> >>>> Thank you so much for the support. Have a great day! >>>> >>>> Warn regards, >>>> PM >>>> >>>> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font < >>>> hectora...@gmail.com> wrote: >>>> >>>>> Hi Priyank, >>>>> >>>>> I calculated the coverage on the Wikipedia dumps I got, and which I >>>>> used for getting the frequency lists. I think this is fair, since these >>>>> corpora are enormous. But I calculated WER on the basis of other texts. I >>>>> calculated it only a few times, at fixed project benchmarks, since I >>>>> needed >>>>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3 >>>>> pseudo-random "good" Wikipedia articles (the feature of the day and two >>>>> more). I just took the introduction at the beginning. This ups to c. 1000 >>>>> words. Sometimes I took random front page news from top newspapers >>>>> (typically, sociopolitical). In the final calculation, I got 4-5 short >>>>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the >>>>> type of language I aimed. The idea has been to develop I tool for a more >>>>> or >>>>> less under-resourced language, especially for helping the creation of >>>>> Wikipedia articles. >>>>> >>>>> @Marc Riera Irigoyen <marc.riera.irigo...@gmail.com> has used another >>>>> strategy for following the evolution of WER/PER (see >>>>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got >>>>> a reference text for the whole project and automatically tested against it >>>>> at the end of every week. If you use this strategy, you have to be very >>>>> disciplined and not be influenced by the mistakes you see in these tests >>>>> (this means not adding certain words in dictionaries, or morphological >>>>> disambiguation rules, lexical selection rules, or transfer rules because >>>>> of >>>>> detected errors during this weekly tests). I am not really a good example >>>>> of discipline at work, so I prefer to use the more manual, and more >>>>> time-consuming, method that I have described above. >>>>> >>>>> Currently, I'm preparing my own proposal, and I'm doing as you. As >>>>> you, my proposal includes a widely-used language, which is released in >>>>> Apertium, and a (very) under-resourced language, unreleased in Apertium, >>>>> which needs a lot of work. I have got a test text for both languages and >>>>> I've added the needed words in the dictionaries, so that most of the text >>>>> is translated. It is just a test, because still there are big errors due >>>>> to >>>>> the lack of transfer rules (although, I've copied some useful transfer >>>>> rules from another close-related language pair). I'm currently collecting >>>>> resources: dictionaries, texts in the under-resourced language and >>>>> bilingual texts (in my case, it is not so easy, because the >>>>> under-resourced >>>>> language is really very under-resourced, there are several competing >>>>> orthographies, and there is a very big dialect variety). I'm also seeing >>>>> which major transfer rules have to be included. In your case, I suppose >>>>> you'll use a 3-stage transfer, so you should plan what will have to be >>>>> done >>>>> in each of stages 1, 2 and 3. This includes to plan which information >>>>> should have the chunk headers created at stage 1. I guess, the Hindi-Urdu >>>>> language pair can be a good possibility, but maybe something else would >>>>> need to be added in the headers, since Hindi and Urdu are extremely closed >>>>> languages, and Punjabi, as far as I know, is not so closed to Hindi. >>>>> >>>>> Best, >>>>> Hèctor >>>>> >>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 12 de >>>>> març 2020 a les 2:44: >>>>> >>>>>> Hi Hector, >>>>>> Thank you so much for the reply. The proposals were really helpful. >>>>>> I've completed the coding challenge for a small set of 10 sentences(for >>>>>> now) which I believe Francis has added to the repo as a test set. I'll >>>>>> included the same in the proposal. For now, I'm working on building the >>>>>> dictionaries using the wiki dumps as mentioned in the documentation, >>>>>> adding >>>>>> the most frequent words systematically. >>>>>> Looking through your proposal, I noticed that you included metrics >>>>>> like WER and coverage to determine progress. I just wanted to confirm if >>>>>> these are being computed against the dumps one downloads for the >>>>>> respective >>>>>> languages(which seems to be the case seeing the way you mentioned the >>>>>> same >>>>>> in your own proposal)? Or is there some separate benchmark? This will be >>>>>> helpful as I can then go ahead and mention the current state of the >>>>>> dictionaries in a more statistical manner. >>>>>> >>>>>> Finally, is there something else I can do to make my proposal better? >>>>>> Or is it advisable to start working on my proposal/some other non-entry >>>>>> level project? >>>>>> >>>>>> Thank you for sharing the proposals and the guidance once again. >>>>>> Have a great day! >>>>>> >>>>>> Warm regards, >>>>>> PM >>>>>> >>>>>> -- >>>>>> Priyank Modi ● Undergrad Research Student >>>>>> IIIT-Hyderabad ● Language Technologies Research Center >>>>>> Mobile: +91 83281 45692 >>>>>> Website <https://priyankmodipm.github.io/> ● Linkedin >>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>>>>> >>>>>> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font < >>>>>> hectora...@gmail.com> wrote: >>>>>> >>>>>>> Hi Priyank, >>>>>>> >>>>>>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual >>>>>>> that closely related pairs give not very satisfactory results with >>>>>>> Google, >>>>>>> because most of the time there is as an intermediate translation into >>>>>>> English. In any case, if you can give some data about the quality of the >>>>>>> Google translator (as I did in my 2019 GSoC application >>>>>>> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>), >>>>>>> it may be useful, I think. >>>>>>> >>>>>>> In order to present an application for a language-pair development >>>>>>> it is required to pass the so called "coding challenge" >>>>>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Coding_challenge>. >>>>>>> Basically, this will show that you understand the basis of the >>>>>>> architecture >>>>>>> and knows how to add new words in the dictionaries. >>>>>>> >>>>>>> For the project itself, you'll need to add many words to the Punjabi >>>>>>> and Punjabi-Hindi dictionaries, transfer rules and lexical selection >>>>>>> rules. >>>>>>> If you intend to translate from Punjabi, you'll need to work on >>>>>>> morphological disambiguation, which needs at least a couple of weeks of >>>>>>> work. This is basic, since plenty of errors in Indo-European languages >>>>>>> (and, I guess, not only) come from bad morphological disambiguation. >>>>>>> Usually, closed categories are added first in the dictionaries and >>>>>>> afterwards words are mostly added using frequency lists. If there are >>>>>>> free >>>>>>> resources you may use, this would be great, but it is absolutely >>>>>>> necessary >>>>>>> not to automatically copy from copyrighted materials. For my own >>>>>>> application this year, I'm asking people to free their resources in >>>>>>> order >>>>>>> to be able to use them. >>>>>>> >>>>>>> You may be interested in previous applications for developing >>>>>>> language pairs, for instance this one >>>>>>> <http://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd>, >>>>>>> in addition to mine last year. >>>>>>> >>>>>>> Best wishes, >>>>>>> Hèctor >>>>>>> >>>>>>> >>>>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dv., 6 >>>>>>> de març 2020 a les 23:49: >>>>>>> >>>>>>>> Hi, >>>>>>>> I am trying to work towards developing the Hindi-Punjabi pair and >>>>>>>> needed some guidance on how to go about it. I ran the test files and >>>>>>>> could >>>>>>>> notice that the dictionary file for Punjabi needs work(even a lot of >>>>>>>> function words could not be found by the translator). Should I start >>>>>>>> with >>>>>>>> that? Are there some tests each stage needs to pass? Also, finally what >>>>>>>> sort of work is expected to make a decent GSOC proposal, of course >>>>>>>> I'll be >>>>>>>> interested in developing this pair regardless since even Google >>>>>>>> translate >>>>>>>> doesn't seem to work well for this pair(for the test set specifically >>>>>>>> the >>>>>>>> apertium translator worked significantly better) >>>>>>>> Any help would be appreciated. >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Warm regards, >>>>>>>> PM >>>>>>>> >>>>>>>> -- >>>>>>>> Priyank Modi ● Undergrad Research Student >>>>>>>> IIIT-Hyderabad ● Language Technologies Research Center >>>>>>>> Mobile: +91 83281 45692 >>>>>>>> Website <https://priyankmodipm.github.io/> ● Linkedin >>>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Apertium-stuff mailing list >>>>>>>> Apertium-stuff@lists.sourceforge.net >>>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Apertium-stuff mailing list >>>>>>> Apertium-stuff@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Priyank Modi ● Undergrad Research Student >>>>>> IIIT-Hyderabad ● Language Technologies Research Center >>>>>> Mobile: +91 83281 45692 >>>>>> Website <https://priyankmodipm.github.io/> ● Linkedin >>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>>>>> _______________________________________________ >>>>>> Apertium-stuff mailing list >>>>>> Apertium-stuff@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>> >>>>> _______________________________________________ >>>>> Apertium-stuff mailing list >>>>> Apertium-stuff@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>> >>>> >>>> >>>> -- >>>> Priyank Modi ● Undergrad Research Student >>>> IIIT-Hyderabad ● Language Technologies Research Center >>>> Mobile: +91 83281 45692 >>>> Website <https://priyankmodipm.github.io/> ● Linkedin >>>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>>> _______________________________________________ >>>> Apertium-stuff mailing list >>>> Apertium-stuff@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>> >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> > > -- > Priyank Modi ● Undergrad Research Student > IIIT-Hyderabad ● Language Technologies Research Center > Mobile: +91 83281 45692 > Website <https://priyankmodipm.github.io/> ● Linkedin > <https://www.linkedin.com/in/priyank-modi-81584b175/> > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff