Dear all,                                       5/9/09

interesting topic!

At 14:47 +1000 5/09/09, Vadim Berman wrote:
Hi Jeff,

An interesting question.

In our system, we tend to store it in one lexicon with different stylistic tags.

Obviously, the grammar, etc. must be reused.

Best regards,
Vadim

----- Original Message ----- From: "Jeff Allen" <[email protected]>
To: <[email protected]>
Sent: Friday, September 04, 2009 11:54 PM
Subject: [Mt-list] MT for languages with multiple scripts

Hi listers,

What is experience on creating MT systems for languages with multiple scripts?

Abbas Malik has been working exactly on that in our lab.

Hindi-Urdu is an interesting case, because it is really the same language (earlier called hindustani), with lexical variants (there are pairs of terms of arabo-persian origin and sanscrit origin, to simplify), but 2 scripts. Urdu speakers usually don't read the nagari script, and hindi speakers don't read the arabic-based script, but they understand each other, no need of subtitles for Bollywood films.

That pair is an instance of the very interesting subclass of pairs for which the translation problem is enormously reduced: strongly surface-related languages.

In general, a sentence in L1 has thousands of acceptable translations (millions if they are long), where "acceptable" means "acceptable by professional translators or teachers of translation".

That is the main reason why using BLEU, NIST, etc. to measure translation quality fails in practice (these measures don't correlate with professional judgments of translation quality, or only when they are very bad). However, they can be used with profit to measure the progress of an MT system towards a goal expressed by a set of reference translations (for expert, or hand-crafted MT systems, one then should include posteditions of MT output in the references).

By contrast, if (L1, L2) is a pair of "strongly surface-related languages", a sentence in L1 has only 1 or a very small number of "exact translations" (and conversely), the other possible translations being clearly judged as paraphrases. In this way, it is similar to the speech recognition problem.

Concerning that class, one can speak of "the translation problem", meaning the "exact translation problem".

- choose only one script

That would be possible if the transformation from 1 script to the other would be strightforward. Unfortunately, it is often a hard problem because of mutual underspecifications. For example, small vowels are not written in the urdu script, and some consonant-related distinctions are not made in the hindi script.

- Separate system per script
- Toggle between variants after system launch

There are methods to divide a text into homogeneous regions corresponding to triples (language, script, encoding), but

- etc

What factors that have influenced how you have chosen to implement multiple
scripts for a given language?  As source language and/or as target language.

Examples of languages:

* Chinese (Traditional, Simplified, and handling Kanji)

"Kanji" is the term used for sino-japanese ideograms. "Hanze" is used for chinese characters. I read that only 1600 characters are "simplified" (the traditional forms seems to be "reused" in parallel with the simplified forms so that the 2 forms coexist for a sizable part of these 1600) and there are 2 different subsets (of a total of about 80000 characters since the origin) used on mainland China and Taiwan. On PCs, around 1990, Japanese versions of OS stored about 5000 kanji, and Chinese versions about 8000. More recent versions should have all that exist in Unicode, I did not check.

--> Perhaps the best source of information concerning these problems, and those of mutual translitterations and trasncriptions (in particular proper nouns) is Jack Halpern and (cjk.org).

Question: what do you mean by "handling Kanji"? Did you find some interesting case when a Chinese text contains specific kanjis (I mean, characters created in Japan and absent of the Chinese character set).

Perhaps there is also some need of handling texts in pinyin for MT?

* Bosnian/Croatian/Serbian (Latin alphabet, Cyrillic alphabet)
* Mongolian (Classic script, Cyrillic script)

Yes, these examples are interesting. Somebody told me Mongolian also has a Chinese-based script.

Many languages from the Turkish family also have 2 or 2 scripts. Turkish istelf was written in an arabic script before Ataturk. Azeri, uzbek, tadjik, etc. have arabic, cyrillic and latin-based scripts, and perhaps some of Central Asia can also be written with hanze.

Best regards,

Xan


Regards,

Jeff
_______________________________________________
Mt-list mailing list


--
-------------------------------------------------------------------------
Christian Boitet
(Pr. Universite' Joseph Fourier)
Groupe d'Etude pour la Traduction Automatique
                 et le Traitement Automatisé des Langues et de la Parole
G        E             T          A              L                P

GETALP, LIG-campus, BP 53 (ex: GETA, CLIPS, IMAG-campus) Tel: +33 (0)4 76 51 43 55 / 51 48 17 Fax: +33 (0)4 76 63 56 86
385, rue de la Bibliothe`que           Mel: [email protected]
38041 Grenoble Cedex 9, France 
_______________________________________________
Mt-list mailing list

Reply via email to