Dear all, 5/9/09
interesting topic!
At 14:47 +1000 5/09/09, Vadim Berman wrote:
Hi Jeff,
An interesting question.
In our system, we tend to store it in one
lexicon with different stylistic tags.
Obviously, the grammar, etc. must be reused.
Best regards,
Vadim
----- Original Message ----- From: "Jeff Allen" <[email protected]>
To: <[email protected]>
Sent: Friday, September 04, 2009 11:54 PM
Subject: [Mt-list] MT for languages with multiple scripts
Hi listers,
What is experience on creating MT systems for
languages with multiple scripts?
Abbas Malik has been working exactly on that in our lab.
Hindi-Urdu is an interesting case, because it is
really the same language (earlier called
hindustani), with lexical variants (there are
pairs of terms of arabo-persian origin and
sanscrit origin, to simplify), but 2 scripts.
Urdu speakers usually don't read the nagari
script, and hindi speakers don't read the
arabic-based script, but they understand each
other, no need of subtitles for Bollywood films.
That pair is an instance of the very interesting
subclass of pairs for which the translation
problem is enormously reduced: strongly
surface-related languages.
In general, a sentence in L1 has thousands of
acceptable translations (millions if they are
long), where "acceptable" means "acceptable by
professional translators or teachers of
translation".
That is the main reason why using BLEU, NIST,
etc. to measure translation quality fails in
practice (these measures don't correlate with
professional judgments of translation quality, or
only when they are very bad). However, they can
be used with profit to measure the progress of an
MT system towards a goal expressed by a set of
reference translations (for expert, or
hand-crafted MT systems, one then should include
posteditions of MT output in the references).
By contrast, if (L1, L2) is a pair of "strongly
surface-related languages", a sentence in L1 has
only 1 or a very small number of "exact
translations" (and conversely), the other
possible translations being clearly judged as
paraphrases. In this way, it is similar to the
speech recognition problem.
Concerning that class, one can speak of "the
translation problem", meaning the "exact
translation problem".
- choose only one script
That would be possible if the transformation from
1 script to the other would be strightforward.
Unfortunately, it is often a hard problem because
of mutual underspecifications. For example, small
vowels are not written in the urdu script, and
some consonant-related distinctions are not made
in the hindi script.
- Separate system per script
- Toggle between variants after system launch
There are methods to divide a text into
homogeneous regions corresponding to triples
(language, script, encoding), but
- etc
What factors that have influenced how you have chosen to implement multiple
scripts for a given language? As source language and/or as target language.
Examples of languages:
* Chinese (Traditional, Simplified, and handling Kanji)
"Kanji" is the term used for sino-japanese
ideograms. "Hanze" is used for chinese
characters. I read that only 1600 characters are
"simplified" (the traditional forms seems to be
"reused" in parallel with the simplified forms so
that the 2 forms coexist for a sizable part of
these 1600) and there are 2 different subsets (of
a total of about 80000 characters since the
origin) used on mainland China and Taiwan. On
PCs, around 1990, Japanese versions of OS stored
about 5000 kanji, and Chinese versions about
8000. More recent versions should have all that
exist in Unicode, I did not check.
--> Perhaps the best source of information
concerning these problems, and those of mutual
translitterations and trasncriptions (in
particular proper nouns) is Jack Halpern and
(cjk.org).
Question: what do you mean by "handling Kanji"?
Did you find some interesting case when a Chinese
text contains specific kanjis (I mean, characters
created in Japan and absent of the Chinese
character set).
Perhaps there is also some need of handling texts in pinyin for MT?
* Bosnian/Croatian/Serbian (Latin alphabet, Cyrillic alphabet)
* Mongolian (Classic script, Cyrillic script)
Yes, these examples are interesting. Somebody
told me Mongolian also has a Chinese-based script.
Many languages from the Turkish family also have
2 or 2 scripts. Turkish istelf was written in an
arabic script before Ataturk. Azeri, uzbek,
tadjik, etc. have arabic, cyrillic and
latin-based scripts, and perhaps some of Central
Asia can also be written with hanze.
Best regards,
Xan
Regards,
Jeff
_______________________________________________
Mt-list mailing list
--
-------------------------------------------------------------------------
Christian Boitet
(Pr. Universite' Joseph Fourier)
Groupe d'Etude pour la Traduction Automatique
et le Traitement Automatisé des Langues et de la Parole
G E T A L P
GETALP, LIG-campus, BP 53 (ex: GETA, CLIPS, IMAG-campus)
Tel: +33 (0)4 76 51 43 55 / 51 48 17 Fax: +33 (0)4 76 63 56 86
385, rue de la Bibliothe`que Mel: [email protected]
38041 Grenoble Cedex 9, France
_______________________________________________
Mt-list mailing list