Re: [Apertium-stuff] Guidance for hin-pan language pair development

Hèctor Alòs i Font Fri, 20 Mar 2020 22:19:44 -0700

Hi Prinyak,

I've been looking at you coding challenge. I can't understand anything, but
I see the symbol # relatively often. That is annoying. See:
http://wiki.apertium.org/wiki/Apertium_stream_format#Special


This happens, for instance, when in the bidix the target word has a given
gender and/or case, but in the monodix it has another. The lemma is
recognized, but there isn't any information for generating the surface form
as received from the bidix + transfer.

Using apertium-viewer, I analysed this case:

सब
^सब/सब<adj><mfn><sp>/सब<adj><m><sg><nom>/सब<adj><m><sg><obl>/सब<adj><m><pl><nom>/सब<adj><m><pl><obl>/सब<adj><f><sg><nom>/सब<adj><f><sg><obl>/सब<adj><f><pl><nom>/सब<adj><f><pl><obl>/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$

^सब/सब<prn><pers><p3><mf><pl><nom>/सब<prn><pers><p3><mf><pl><obl>$
^सब<prn><pers><p3><mf><pl><nom>$
^सब<prn><pers><p3><mf><pl><nom>/ਸਭ<prn><pers><p3><mf><pl><nom>$
^default<default>{^ਸਭ<prn><pers><p3><mf><pl><nom>$}$
^ਸਭ<prn><pers><p3><mf><pl><nom>$
#ਸਭ

As expected, the problem is that ^ਸਭ<prn><pers><p3><mf><pl><nom>$ cannot be
generated.

Then I do:
apertium-pan$ echo "ਸਭ" | apertium -d . pan_Guru-disam
"<ਸਭ>"
"ਸਭ" adj mfn sp
"ਸਭ" adj m sg nom
"ਸਭ" adj m sg obl
"ਸਭ" adj m pl nom
"ਸਭ" adj m pl obl
"ਸਭ" adj f sg nom
"ਸਭ" adj f sg obl
"ਸਭ" adj f pl nom
"ਸਭ" adj f pl obl
"<.>"
"." sent

So that's the problem: in the bidix it is said that ਸਭ is a pronoun, but in
the monodix is defined as an adjective.

By the way, it seems strange that you have 9 analyses for this adjective.
Usually in these cases we put only the first analysis in the dictionary.
The other, in really needed, can be added as <e r="RL">.

Best,
Hèctor


Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 19 de març
2020 a les 0:29:

> Hi Hector, Francis;
> I've made progress on the coding challenge and wanted your* feedback *on
> it - https://github.com/priyankmodiPM/apertium-hin-pan_pmodi
> *(The bin files remained after a `make clean`, so I didn't remove them
> from the repo, let me know if this is incorrect)*
>
> > I've attempted to translate the file already added in the original
> repository
> <https://github.com/apertium/apertium-hin-pan/tree/b8cea06c4748b24db7eb7e94b455a491425c04b5>
> .
> > Output file
> <https://github.com/priyankmodiPM/apertium-hin-pan_pmodi/blob/master/apertium-hin-pan/test_pan.txt>
> > Right now, I'm fixing the few missing/un/incorrectly translated words
> and focusing more on translating a full article which can be compared
> against a benchmark(parallel text), using the techniques mentioned in the
> section on Building dictionaries
> <http://wiki.apertium.org/wiki/Building_dictionaries>. I'll be mentioning
> the WER and coverage details in my proposal.
> > As Hector mentioned last time, I've been able to find some parallel
> texts and am asking others to free their resources. I was able to retrieve
> a good corpus available at request(owned by the tourism department of the
> state). Could someone *send me the terms for safely using a corpus*?
> > Given that both Hindi and Punjabi have phonemic orthography, could we
> use *fuzzy string matching*(simple string mapping in this case) to
> translate proper nouns/borrowed words(at least single word NEs)?
> > Finally, could you point out to me some *resources about the way case
> markers and dependencies* are being used in the apertium model? This
> could be crucial for this language pair because most of the POS tagging and
> chunking revolves around the case markers and dependency relations.
>
> Thank you so much for the support. Have a great day!
>
> Warn regards,
> PM
>
> On Thu, Mar 12, 2020 at 10:46 AM Hèctor Alòs i Font <hectora...@gmail.com>
> wrote:
>
>> Hi Priyank,
>>
>> I calculated the coverage on the Wikipedia dumps I got, and which I used
>> for getting the frequency lists. I think this is fair, since these corpora
>> are enormous. But I calculated WER on the basis of other texts. I
>> calculated it only a few times, at fixed project benchmarks, since I needed
>> 2-3 hours for it (maybe because I work too slowly). Every time I got 3
>> pseudo-random "good" Wikipedia articles (the feature of the day and two
>> more). I just took the introduction at the beginning. This ups to c. 1000
>> words. Sometimes I took random front page news from top newspapers
>> (typically, sociopolitical). In the final calculation, I got 4-5 short
>> texts from both Wikipedia and newspapers (c. 1500 words). This shows the
>> type of language I aimed. The idea has been to develop I tool for a more or
>> less under-resourced language, especially for helping the creation of
>> Wikipedia articles.
>>
>> @Marc Riera Irigoyen <marc.riera.irigo...@gmail.com> has used another
>> strategy for following the evolution of WER/PER (see
>> http://wiki.apertium.org/wiki/Romanian_and_Catalan/Workplan). He got a
>> reference text for the whole project and automatically tested against it at
>> the end of every week. If you use this strategy, you have to be very
>> disciplined and not be influenced by the mistakes you see in these tests
>> (this means not adding certain words in dictionaries, or morphological
>> disambiguation rules, lexical selection rules, or transfer rules because of
>> detected errors during this weekly tests). I am not really a good example
>> of discipline at work, so I prefer to use the more manual, and more
>> time-consuming, method that I have described above.
>>
>> Currently, I'm preparing my own proposal, and I'm doing as you. As you,
>> my proposal includes a widely-used language, which is released in Apertium,
>> and a (very) under-resourced language, unreleased in Apertium, which needs
>> a lot of work. I have got a test text for both languages and I've added the
>> needed words in the dictionaries, so that most of the text is translated.
>> It is just a test, because still there are big errors due to the lack of
>> transfer rules (although, I've copied some useful transfer rules from
>> another close-related language pair). I'm currently collecting resources:
>> dictionaries, texts in the under-resourced language and bilingual texts (in
>> my case, it is not so easy, because the under-resourced language is really
>> very under-resourced, there are several competing orthographies, and there
>> is a very big dialect variety). I'm also seeing which major transfer rules
>> have to be included. In your case, I suppose you'll use a 3-stage transfer,
>> so you should plan what will have to be done in each of stages 1, 2 and 3.
>> This includes to plan which information should have the chunk headers
>> created at stage 1. I guess, the Hindi-Urdu language pair can be a good
>> possibility, but maybe something else would need to be added in the
>> headers, since Hindi and Urdu are extremely closed languages, and Punjabi,
>> as far as I know, is not so closed to Hindi.
>>
>> Best,
>> Hèctor
>>
>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dj., 12 de
>> març 2020 a les 2:44:
>>
>>> Hi Hector,
>>> Thank you so much for the reply. The proposals were really helpful. I've
>>> completed the coding challenge for a small set of 10 sentences(for now)
>>> which I believe Francis has added to the repo as a test set. I'll included
>>> the same in the proposal. For now, I'm working on building the dictionaries
>>> using the wiki dumps as mentioned in the documentation, adding the most
>>> frequent words systematically.
>>> Looking through your proposal, I noticed that you included metrics like
>>> WER and coverage to determine progress. I just wanted to confirm if these
>>> are being computed against the dumps one downloads for the respective
>>> languages(which seems to be the case seeing the way you mentioned the same
>>> in your own proposal)? Or is there some separate benchmark? This will be
>>> helpful as I can then go ahead and mention the current state of the
>>> dictionaries in a more statistical manner.
>>>
>>> Finally, is there something else I can do to make my proposal better? Or
>>> is it advisable to start working on my proposal/some other non-entry level
>>> project?
>>>
>>> Thank you for sharing the proposals and the guidance once again.
>>> Have a great day!
>>>
>>> Warm regards,
>>> PM
>>>
>>> --
>>> Priyank Modi      ●  Undergrad Research Student
>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>> Mobile:  +91 83281 45692
>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>
>>> On Sat, Mar 7, 2020 at 11:43 AM Hèctor Alòs i Font <hectora...@gmail.com>
>>> wrote:
>>>
>>>> Hi Priyank,
>>>>
>>>> Hindi-Punjabi seems to me a very nice pair for Apertium. It is usual
>>>> that closely related pairs give not very satisfactory results with Google,
>>>> because most of the time there is as an intermediate translation into
>>>> English. In any case, if you can give some data about the quality of the
>>>> Google translator (as I did in my 2019 GSoC application
>>>> <http://wiki.apertium.org/wiki/Hectoralos/GSOC_2019_proposal:_Catalan-Italian_and_Catalan-Portuguese#Current_situation_of_the_language_pairs>),
>>>> it may be useful, I think.
>>>>
>>>> In order to present an application for a language-pair development it
>>>> is required to pass the so called "coding challenge"
>>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair#Coding_challenge>.
>>>> Basically, this will show that you understand the basis of the architecture
>>>> and knows how to add new words in the dictionaries.
>>>>
>>>> For the project itself, you'll need to add many words to the Punjabi
>>>> and Punjabi-Hindi dictionaries, transfer rules and lexical selection rules.
>>>> If you intend to translate from Punjabi, you'll need to work on
>>>> morphological disambiguation, which needs at least a couple of weeks of
>>>> work. This is basic, since plenty of errors in Indo-European languages
>>>> (and, I guess, not only) come from bad morphological disambiguation.
>>>> Usually, closed categories are added first in the dictionaries and
>>>> afterwards words are mostly added using frequency lists. If there are free
>>>> resources you may use, this would be great, but it is absolutely necessary
>>>> not to automatically copy from copyrighted materials. For my own
>>>> application this year, I'm asking people to free their resources in order
>>>> to be able to use them.
>>>>
>>>> You may be interested in previous applications for developing language
>>>> pairs, for instance this one
>>>> <http://wiki.apertium.org/wiki/Grfro3d/proposal_apertium_cat-srd_and_ita-srd>,
>>>> in addition to mine last year.
>>>>
>>>> Best wishes,
>>>> Hèctor
>>>>
>>>>
>>>> Missatge de Priyank Modi <priyankmod...@gmail.com> del dia dv., 6 de
>>>> març 2020 a les 23:49:
>>>>
>>>>> Hi,
>>>>> I am trying to work towards developing the Hindi-Punjabi pair and
>>>>> needed some guidance on how to go about it. I ran the test files and could
>>>>> notice that the dictionary file for Punjabi needs work(even a lot of
>>>>> function words could not be found by the translator). Should I start with
>>>>> that? Are there some tests each stage needs to pass? Also, finally what
>>>>> sort of work is expected to make a decent GSOC proposal, of course I'll be
>>>>> interested in developing this pair regardless since even Google translate
>>>>> doesn't seem to work well for this pair(for the test set specifically the
>>>>> apertium translator worked significantly better)
>>>>> Any help would be appreciated.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Warm regards,
>>>>> PM
>>>>>
>>>>> --
>>>>> Priyank Modi       ●  Undergrad Research Student
>>>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>>>> Mobile:  +91 83281 45692
>>>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>>>>
>>>>> _______________________________________________
>>>>> Apertium-stuff mailing list
>>>>> Apertium-stuff@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>>
>>>
>>> --
>>> Priyank Modi       ●  Undergrad Research Student
>>> IIIT-Hyderabad        ●  Language Technologies Research Center
>>> Mobile:  +91 83281 45692
>>> Website <https://priyankmodipm.github.io/>    ●    Linkedin
>>> <https://www.linkedin.com/in/priyank-modi-81584b175/>
>>> _______________________________________________
>>> Apertium-stuff mailing list
>>> Apertium-stuff@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
>
>
> --
> Priyank Modi      ●  Undergrad Research Student
> IIIT-Hyderabad        ●  Language Technologies Research Center
> Mobile:  +91 83281 45692
> Website <https://priyankmodipm.github.io/>    ●    Linkedin
> <https://www.linkedin.com/in/priyank-modi-81584b175/>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Guidance for hin-pan language pair development

Reply via email to