Re: [Mt-list] MT for languages with multiple scripts

Christian Boitet Sat, 05 Sep 2009 04:36:11 -0700

Dear all,                                       5/9/09


interesting topic!

At 14:47 +1000 5/09/09, Vadim Berman wrote:

Hi Jeff,

An interesting question.
In our system, we tend to store it in onelexicon with different stylistic tags.
Obviously, the grammar, etc. must be reused.

Best regards,
Vadim

----- Original Message ----- From: "Jeff Allen" <[email protected]>
To: <[email protected]>
Sent: Friday, September 04, 2009 11:54 PM
Subject: [Mt-list] MT for languages with multiple scripts
Hi listers,
What is experience on creating MT systems forlanguages with multiple scripts?


Abbas Malik has been working exactly on that in our lab.

Hindi-Urdu is an interesting case, because it isreally the same language (earlier calledhindustani), with lexical variants (there arepairs of terms of arabo-persian origin andsanscrit origin, to simplify), but 2 scripts.Urdu speakers usually don't read the nagariscript, and hindi speakers don't read thearabic-based script, but they understand eachother, no need of subtitles for Bollywood films.

That pair is an instance of the very interestingsubclass of pairs for which the translationproblem is enormously reduced: stronglysurface-related languages.

In general, a sentence in L1 has thousands ofacceptable translations (millions if they arelong), where "acceptable" means "acceptable byprofessional translators or teachers oftranslation".

That is the main reason why using BLEU, NIST,etc. to measure translation quality fails inpractice (these measures don't correlate withprofessional judgments of translation quality, oronly when they are very bad). However, they canbe used with profit to measure the progress of anMT system towards a goal expressed by a set ofreference translations (for expert, orhand-crafted MT systems, one then should includeposteditions of MT output in the references).

By contrast, if (L1, L2) is a pair of "stronglysurface-related languages", a sentence in L1 hasonly 1 or a very small number of "exacttranslations" (and conversely), the otherpossible translations being clearly judged asparaphrases. In this way, it is similar to thespeech recognition problem.

Concerning that class, one can speak of "thetranslation problem", meaning the "exacttranslation problem".

- choose only one script

That would be possible if the transformation from1 script to the other would be strightforward.Unfortunately, it is often a hard problem becauseof mutual underspecifications. For example, smallvowels are not written in the urdu script, andsome consonant-related distinctions are not madein the hindi script.

- Separate system per script
- Toggle between variants after system launch

There are methods to divide a text intohomogeneous regions corresponding to triples(language, script, encoding), but

- etc

What factors that have influenced how you have chosen to implement multiple
scripts for a given language?  As source language and/or as target language.

Examples of languages:

* Chinese (Traditional, Simplified, and handling Kanji)

"Kanji" is the term used for sino-japaneseideograms. "Hanze" is used for chinesecharacters. I read that only 1600 characters are"simplified" (the traditional forms seems to be"reused" in parallel with the simplified forms sothat the 2 forms coexist for a sizable part ofthese 1600) and there are 2 different subsets (ofa total of about 80000 characters since theorigin) used on mainland China and Taiwan. OnPCs, around 1990, Japanese versions of OS storedabout 5000 kanji, and Chinese versions about8000. More recent versions should have all thatexist in Unicode, I did not check.

--> Perhaps the best source of informationconcerning these problems, and those of mutualtranslitterations and trasncriptions (inparticular proper nouns) is Jack Halpern and(cjk.org).

Question: what do you mean by "handling Kanji"?Did you find some interesting case when a Chinesetext contains specific kanjis (I mean, characterscreated in Japan and absent of the Chinesecharacter set).


Perhaps there is also some need of handling texts in pinyin for MT?

* Bosnian/Croatian/Serbian (Latin alphabet, Cyrillic alphabet)
* Mongolian (Classic script, Cyrillic script)

Yes, these examples are interesting. Somebodytold me Mongolian also has a Chinese-based script.

Many languages from the Turkish family also have2 or 2 scripts. Turkish istelf was written in anarabic script before Ataturk. Azeri, uzbek,tadjik, etc. have arabic, cyrillic andlatin-based scripts, and perhaps some of CentralAsia can also be written with hanze.


Best regards,

Xan


Regards,

Jeff
_______________________________________________
Mt-list mailing list



--
-------------------------------------------------------------------------
Christian Boitet
(Pr. Universite' Joseph Fourier)
Groupe d'Etude pour la Traduction Automatique
                 et le Traitement Automatisé des Langues et de la Parole
G        E             T          A              L                P

GETALP, LIG-campus, BP 53 (ex: GETA, CLIPS, IMAG-campus)Tel: +33 (0)4 76 51 43 55 / 51 48 17 Fax: +33 (0)4 76 63 56 86

385, rue de la Bibliothe`que           Mel: [email protected]
38041 Grenoble Cedex 9, France

_______________________________________________
Mt-list mailing list

Re: [Mt-list] MT for languages with multiple scripts

Reply via email to