> > By the way, it seems strange that you have 9 analyses for this adjective. > Usually in these cases we put only the first analysis in the dictionary. > The other, in really needed, can be added as <e r="RL">.
Regarding this, I found a number of such anomalies in the Hindi monodix, and tried to resolve some of them by asking mentors on irc. But since urdu-hindi is a released pair(and hence the hindi monodix should have been reviewed) I have tried to add similar rules in the Punjabi monodix as well. This will have to be fixed in the final version. I guess following your suggestion, I'll add to my list of (possible) errors I find in current hin, hin-pan dictionaries and report the same in the proposal. This will also help me in getting quick feedback on most of these so that I can alteast bring the hindi monodix up to a reviewed and correct state during the duration between post-proposal and acceptance period. :D Does this look good? Thanks. On Sat, Mar 21, 2020 at 11:37 AM Priyank Modi <priyankmod...@gmail.com> wrote: > Hi Hector, > Thank you so much for taking time to look at my challenge in detail and > providing the feedback. I already understand this error and will work on > removing all '#' symbols in the final submission of my coding challenge. To > start with, the number of '#'s were atleast 3-4 times of what I have > currently. Quite a few of these still exist because these words were > already added to bidix but the monodix for Punjabi was almost empty when I > started off(u can check the original repo in the incubator). > Anyways, this has been really helpful and I'll make sure to improve on > this. Since you couldn't read the script, I should tell you that I'm able > to achieve close to human translation for most of these test sentences (as > said earlier, I'll be including an analysis in my proposal explaining the > translations in ipa, with which I'll need your help in reviewing as well 😬) > > I was able to find some dictionaries and parallel texts for both > languages. Is there anything else I can do right now? Could you help me > with some references on the use of case markers during translation as well? > :) > > Thank you again. > > Warm regards, > Priyank > > > On Sat 21 Mar, 2020, 10:49 AM Hèctor Alòs i Font, <hectora...@gmail.com> > wrote: > >> Hi Prinyak, >> >> I've been looking at you coding challenge. I can't understand anything, >> but I see the symbol # relatively often. That is annoying. See: >> http://wiki.apertium.org/wiki/Apertium_stream_format#Special >> >> This happens, for instance, when in the bidix the target word has a given >> gender and/or case, but in the monodix it has another. The lemma is >> recognized, but there isn't any information for generating the surface form >> as received from the bidix + transfer. >> >> Using apertium-viewer, I analysed this case: >> >> सब >> ^सब/सब<adj><mfn><sp>/सब<adj><m><sg><nom>/सब<adj><m><sg><obl>/सब<adj><m><pl><nom>/सब<adj><m><pl><obl>/सब<adj><f><sg><nom>/सब<adj><f><sg><obl>/सब<adj><f><pl><nom>/सब<adj><f><pl><obl>/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$ >> >> ^सब/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$ >> ^सब<prn><pers><p3><mf><pl><nom>$ >> ^सब<prn><pers><p3><mf><pl><nom>/ਸਭ<prn><pers><p3><mf><pl><nom>$ >> ^default<default>{^ਸਭ<prn><pers><p3><mf><pl><nom>$}$ >> ^ਸਭ<prn><pers><p3><mf><pl><nom>$ >> #ਸਭ >> >> As expected, the problem is that ^ਸਭ<prn><pers><p3><mf><pl><nom>$ cannot >> be generated. >> >> Then I do: >> apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam >> "<ਸਭ>" >> "ਸਭ" adj mfn sp >> "ਸਭ" adj m sg nom >> "ਸਭ" adj m sg obl >> "ਸਭ" adj m pl nom >> "ਸਭ" adj m pl obl >> "ਸਭ" adj f sg nom >> "ਸਭ" adj f sg obl >> "ਸਭ" adj f pl nom >> "ਸਭ" adj f pl obl >> "<.>" >> "." sent >> >> So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but >> in the monodix is defined as an adjective. >> >> By the way, it seems strange that you have 9 analyses for this adjective. >> Usually in these cases we put only the first analysis in the dictionary. >> The other, in really needed, can be added as <e r="RL">. >> >> Best, >> Hèctor >> >> >> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 19 de >> març 2020 a les 0:29: >> >>> Hi Hector, Francis; >>> I've made progress on the coding challenge and wanted your* feedback *on >>> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi >>> *(The bin files remained after a `make clean`, so I didn't remove them >>> from the repo, let me know if this is incorrect)* >>> >>> > I've attempted to translate the file already added in the original >>> repository >>> <https://github.com/apertium/apertium-hin-pan/tree/b8cea06c4748b24db7eb7e94b455a491425c04b5> >>> . >>> > Output file >>> <https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt> >>> > Right now, I'm fixing the few missing/un/incorrectly translated words >>> and focusing more on translating a full article which can be compared >>> against a benchmark(parallel text), using the techniques mentioned in the >>> section on Building dictionaries >>> <http://wiki.apertium.org/wiki/Building_dictionaries>. I'll be >>> mentioning the WER and coverage details in my proposal. >>> > As Hector mentioned last time, I've been able to find some parallel >>> texts and am asking others to free their resources. I was able to retrieve >>> a good corpus available at request(owned by the tourism department of the >>> state). Could someone *send me the terms for safely using a corpus*? >>> > Given that both Hindi and Punjabi have phonemic orthography, could we >>> use *fuzzy string matching*(simple string mapping in this case) to >>> translate proper nouns/borrowed words(at least single word NEs)? >>> > Finally, could you point out to me some *resources about the way case >>> markers and dependencies* are being used in the apertium model? This >>> could be crucial for this language pair because most of the POS tagging and >>> chunking revolves around the case markers and dependency relations. >>> >>> Thank you so much for the support. Have a great day! >>> >>> Warn regards, >>> PM >>> >>> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font < >>> hectora...@gmail.com> wrote: >>> >>>> Hi Priyank, >>>> >>>> I calculated the coverage on the Wikipedia dumps I got, and which I >>>> used for getting the frequency lists. I think this is fair, since these >>>> corpora are enormous. But I calculated WER on the basis of other texts. I >>>> calculated it only a few times, at fixed project benchmarks, since I needed >>>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3 >>>> pseudo-random "good" Wikipedia articles (the feature of the day and two >>>> more). I just took the introduction at the beginning. This ups to c. 1000 >>>> words. Sometimes I took random front page news from top newspapers >>>> (typically, sociopolitical). In the final calculation, I got 4-5 short >>>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the >>>> type of language I aimed. The idea has been to develop I tool for a more or >>>> less under-resourced language, especially for helping the creation of >>>> Wikipedia articles. >>>> >>>> @Marc Riera Irigoyen <marc.riera.irigo...@gmail.com> has used another >>>> strategy for following the evolution of WER/PER (see >>>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a >>>> reference text for the whole project and automatically tested against it at >>>> the end of every week. If you use this strategy, you have to be very >>>> disciplined and not be influenced by the mistakes you see in these tests >>>> (this means not adding certain words in dictionaries, or morphological >>>> disambiguation rules, lexical selection rules, or transfer rules because of >>>> detected errors during this weekly tests). I am not really a good example >>>> of discipline at work, so I prefer to use the more manual, and more >>>> time-consuming, method that I have described above. >>>> >>>> Currently, I'm preparing my own proposal, and I'm doing as you. As you, >>>> my proposal includes a widely-used language, which is released in Apertium, >>>> and a (very) under-resourced language, unreleased in Apertium, which needs >>>> a lot of work. I have got a test text for both languages and I've added the >>>> needed words in the dictionaries, so that most of the text is translated. >>>> It is just a test, because still there are big errors due to the lack of >>>> transfer rules (although, I've copied some useful transfer rules from >>>> another close-related language pair). I'm currently collecting resources: >>>> dictionaries, texts in the under-resourced language and bilingual texts (in >>>> my case, it is not so easy, because the under-resourced language is really >>>> very under-resourced, there are several competing orthographies, and there >>>> is a very big dialect variety). I'm also seeing which major transfer rules >>>> have to be included. In your case, I suppose you'll use a 3-stage transfer, >>>> so you should plan what will have to be done in each of stages 1, 2 and 3. >>>> This includes to plan which information should have the chunk headers >>>> created at stage 1. I guess, the Hindi-Urdu language pair can be a good >>>> possibility, but maybe something else would need to be added in the >>>> headers, since Hindi and Urdu are extremely closed languages, and Punjabi, >>>> as far as I know, is not so closed to Hindi. >>>> >>>> Best, >>>> Hèctor >>>> >>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 12 de >>>> març 2020 a les 2:44: >>>> >>>>> Hi Hector, >>>>> Thank you so much for the reply. The proposals were really helpful. >>>>> I've completed the coding challenge for a small set of 10 sentences(for >>>>> now) which I believe Francis has added to the repo as a test set. I'll >>>>> included the same in the proposal. For now, I'm working on building the >>>>> dictionaries using the wiki dumps as mentioned in the documentation, >>>>> adding >>>>> the most frequent words systematically. >>>>> Looking through your proposal, I noticed that you included metrics >>>>> like WER and coverage to determine progress. I just wanted to confirm if >>>>> these are being computed against the dumps one downloads for the >>>>> respective >>>>> languages(which seems to be the case seeing the way you mentioned the same >>>>> in your own proposal)? Or is there some separate benchmark? This will be >>>>> helpful as I can then go ahead and mention the current state of the >>>>> dictionaries in a more statistical manner. >>>>> >>>>> Finally, is there something else I can do to make my proposal better? >>>>> Or is it advisable to start working on my proposal/some other non-entry >>>>> level project? >>>>> >>>>> Thank you for sharing the proposals and the guidance once again. >>>>> Have a great day! >>>>> >>>>> Warm regards, >>>>> PM >>>>> >>>>> -- >>>>> Priyank Modi ● Undergrad Research Student >>>>> IIIT-Hyderabad ● Language Technologies Research Center >>>>> Mobile: +91 83281 45692 >>>>> Website <https://priyankmodipm.github.io/> ● Linkedin >>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>>>> >>>>> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font < >>>>> hectora...@gmail.com> wrote: >>>>> >>>>>> Hi Priyank, >>>>>> >>>>>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual >>>>>> that closely related pairs give not very satisfactory results with >>>>>> Google, >>>>>> because most of the time there is as an intermediate translation into >>>>>> English. In any case, if you can give some data about the quality of the >>>>>> Google translator (as I did in my 2019 GSoC application >>>>>> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>), >>>>>> it may be useful, I think. >>>>>> >>>>>> In order to present an application for a language-pair development it >>>>>> is required to pass the so called "coding challenge" >>>>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Coding_challenge>. >>>>>> Basically, this will show that you understand the basis of the >>>>>> architecture >>>>>> and knows how to add new words in the dictionaries. >>>>>> >>>>>> For the project itself, you'll need to add many words to the Punjabi >>>>>> and Punjabi-Hindi dictionaries, transfer rules and lexical selection >>>>>> rules. >>>>>> If you intend to translate from Punjabi, you'll need to work on >>>>>> morphological disambiguation, which needs at least a couple of weeks of >>>>>> work. This is basic, since plenty of errors in Indo-European languages >>>>>> (and, I guess, not only) come from bad morphological disambiguation. >>>>>> Usually, closed categories are added first in the dictionaries and >>>>>> afterwards words are mostly added using frequency lists. If there are >>>>>> free >>>>>> resources you may use, this would be great, but it is absolutely >>>>>> necessary >>>>>> not to automatically copy from copyrighted materials. For my own >>>>>> application this year, I'm asking people to free their resources in order >>>>>> to be able to use them. >>>>>> >>>>>> You may be interested in previous applications for developing >>>>>> language pairs, for instance this one >>>>>> <http://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd>, >>>>>> in addition to mine last year. >>>>>> >>>>>> Best wishes, >>>>>> Hèctor >>>>>> >>>>>> >>>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dv., 6 de >>>>>> març 2020 a les 23:49: >>>>>> >>>>>>> Hi, >>>>>>> I am trying to work towards developing the Hindi-Punjabi pair and >>>>>>> needed some guidance on how to go about it. I ran the test files and >>>>>>> could >>>>>>> notice that the dictionary file for Punjabi needs work(even a lot of >>>>>>> function words could not be found by the translator). Should I start >>>>>>> with >>>>>>> that? Are there some tests each stage needs to pass? Also, finally what >>>>>>> sort of work is expected to make a decent GSOC proposal, of course I'll >>>>>>> be >>>>>>> interested in developing this pair regardless since even Google >>>>>>> translate >>>>>>> doesn't seem to work well for this pair(for the test set specifically >>>>>>> the >>>>>>> apertium translator worked significantly better) >>>>>>> Any help would be appreciated. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Warm regards, >>>>>>> PM >>>>>>> >>>>>>> -- >>>>>>> Priyank Modi ● Undergrad Research Student >>>>>>> IIIT-Hyderabad ● Language Technologies Research Center >>>>>>> Mobile: +91 83281 45692 >>>>>>> Website <https://priyankmodipm.github.io/> ● Linkedin >>>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Apertium-stuff mailing list >>>>>>> Apertium-stuff@lists.sourceforge.net >>>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>>> >>>>>> _______________________________________________ >>>>>> Apertium-stuff mailing list >>>>>> Apertium-stuff@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>>> >>>>> >>>>> >>>>> -- >>>>> Priyank Modi ● Undergrad Research Student >>>>> IIIT-Hyderabad ● Language Technologies Research Center >>>>> Mobile: +91 83281 45692 >>>>> Website <https://priyankmodipm.github.io/> ● Linkedin >>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>>>> _______________________________________________ >>>>> Apertium-stuff mailing list >>>>> Apertium-stuff@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>>> >>>> _______________________________________________ >>>> Apertium-stuff mailing list >>>> Apertium-stuff@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>>> >>> >>> >>> -- >>> Priyank Modi ● Undergrad Research Student >>> IIIT-Hyderabad ● Language Technologies Research Center >>> Mobile: +91 83281 45692 >>> Website <https://priyankmodipm.github.io/> ● Linkedin >>> <https://www.linkedin.com/in/priyank-modi-81584b175/> >>> _______________________________________________ >>> Apertium-stuff mailing list >>> Apertium-stuff@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >>> >> _______________________________________________ >> Apertium-stuff mailing list >> Apertium-stuff@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/apertium-stuff >> > -- Priyank Modi ● Undergrad Research Student IIIT-Hyderabad ● Language Technologies Research Center Mobile: +91 83281 45692 Website <https://priyankmodipm.github.io/> ● Linkedin <https://www.linkedin.com/in/priyank-modi-81584b175/>
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff