Hi Christian,

Thanks for the detailed reply.  This question came from a human translator who
gets requests for website translation into Traditional or Simplified Chinese and
wanted to know how MT systems (like Google) handle it.  the Kanji reference is
just that the additional info out there provided this as a variant, and might be
inaccurate, and thus being Hanze. The main point is how to handle multiple
scripts.  This is different than orthographies, because it is possible to create
normalization scripts to deal with standardizing multiple orthographies into a
preferred one, and vice versa.

There is likely not one answer because the difference between the multiple
scripts for these several languages mentioned to not represent the same issue.
But just asking how others have dealt it it.

Thanks also Vadim for your input.

How are others approaching it?

Jeff

Quoting Christian Boitet <[email protected]>:

> Dear all,                                     5/9/09
>
> interesting topic!
>
> At 14:47 +1000 5/09/09, Vadim Berman wrote:
> >Hi Jeff,
> >
> >An interesting question.
> >
> >In our system, we tend to store it in one
> >lexicon with different stylistic tags.
> >
> >Obviously, the grammar, etc. must be reused.
> >
> >Best regards,
> >Vadim
> >
> >----- Original Message ----- From: "Jeff Allen" <[email protected]>
> >To: <[email protected]>
> >Sent: Friday, September 04, 2009 11:54 PM
> >Subject: [Mt-list] MT for languages with multiple scripts
> >
> >>Hi listers,
> >>
> >>What is experience on creating MT systems for
> >>languages with multiple scripts?
>
> Abbas Malik has been working exactly on that in our lab.
>
> Hindi-Urdu is an interesting case, because it is
> really the same language (earlier called
> hindustani), with lexical variants (there are
> pairs of terms of arabo-persian origin and
> sanscrit origin, to simplify), but 2 scripts.
> Urdu speakers usually don't read the nagari
> script, and hindi speakers don't read the
> arabic-based script, but they understand each
> other, no need of subtitles for Bollywood films.
>
> That pair is an instance of the very interesting
> subclass of pairs for which the translation
> problem is enormously reduced: strongly
> surface-related languages.
>
> In general, a sentence in L1 has thousands of
> acceptable translations (millions if they are
> long), where "acceptable" means "acceptable by
> professional translators or teachers of
> translation".
>
> That is the main reason why using BLEU, NIST,
> etc. to measure translation quality fails in
> practice (these measures don't correlate with
> professional judgments of translation quality, or
> only when they are very bad). However, they can
> be used with profit to measure the progress of an
> MT system towards a goal expressed by a set of
> reference translations (for expert, or
> hand-crafted MT systems, one then should include
> posteditions of MT output in the references).
>
> By contrast, if (L1, L2) is a pair of "strongly
> surface-related languages", a sentence in L1 has
> only 1 or a very small number of "exact
> translations" (and conversely), the other
> possible translations being clearly judged as
> paraphrases.  In this way, it is similar to the
> speech recognition problem.
>
> Concerning that class, one can speak of "the
> translation problem", meaning the "exact
> translation problem".
>
> >>- choose only one script
>
> That would be possible if the transformation from
> 1 script to the other would be strightforward.
> Unfortunately, it is often a hard problem because
> of mutual underspecifications. For example, small
> vowels are not written in the urdu script, and
> some consonant-related distinctions are not made
> in the hindi script.
>
> >>- Separate system per script
> >>- Toggle between variants after system launch
>
> There are methods to divide a text into
> homogeneous regions corresponding to triples
> (language, script, encoding), but
>
> >>- etc
> >>
> >>What factors that have influenced how you have chosen to implement multiple
> >>scripts for a given language?  As source language and/or as target
> language.
> >>
> >>Examples of languages:
> >>
> >>* Chinese (Traditional, Simplified, and handling Kanji)
>
> "Kanji" is the term used for sino-japanese
> ideograms. "Hanze" is used for chinese
> characters. I read that only 1600 characters are
> "simplified" (the traditional forms seems to be
> "reused" in parallel with the simplified forms so
> that the 2 forms coexist for a sizable part of
> these 1600) and there are 2 different subsets (of
> a total of about 80000 characters since the
> origin) used on mainland China and Taiwan. On
> PCs, around 1990, Japanese versions of OS stored
> about 5000 kanji, and Chinese versions about
> 8000. More recent versions should have all that
> exist in Unicode, I did not check.
>
> --> Perhaps the best source of information
> concerning these problems, and those of mutual
> translitterations and trasncriptions (in
> particular proper nouns) is Jack Halpern and
> (cjk.org).
>
> Question: what do you mean by "handling Kanji"?
> Did you find some interesting case when a Chinese
> text contains specific kanjis (I mean, characters
> created in Japan and absent of the Chinese
> character set).
>
> Perhaps there is also some need of handling texts in pinyin for MT?
>
> >>* Bosnian/Croatian/Serbian (Latin alphabet, Cyrillic alphabet)
> >>* Mongolian (Classic script, Cyrillic script)
>
> Yes, these examples are interesting. Somebody
> told me Mongolian also has a Chinese-based script.
>
> Many languages from the Turkish family also have
> 2 or 2 scripts. Turkish istelf was written in an
> arabic script before Ataturk. Azeri, uzbek,
> tadjik, etc. have arabic, cyrillic and
> latin-based scripts, and perhaps some of Central
> Asia can also be written with hanze.
>
> Best regards,
>
> Xan
>
> >>
> >>Regards,
> >>
> >>Jeff
> >>________
_______________________________________________
Mt-list mailing list

Reply via email to