mt-list  

[Mt-list] pi-languages and pi-pairs of languages

Christian Boitet
Mon, 28 Aug 2006 08:42:21 -0700

Dear colleagues,                                                28/8/06

1) On the terms tau-, mu-, pi- languages and pairs of languages

The point is to CHARACTERIZE in an EXACT an NON-DEPRECATING way languages and pairs of languages for which there is a lack of computerized resources and tools used or directly usable in NLP applications concerning them.

By the way, I forgot to include "pi-pairs" in the previous e-mail, but they do exist.

A pi-pair of languages is a pair for which NLP-related data, resources and tools are lacking.

Berment also uses the terms:
- tau-language (pair) = well (totally / très bien) equipped
- mu-language  (pair) = medium (moyennement bien)  equipped

Example: while French and Thai are reasonably "NLP-equipped" (tau-language and mu-language), the 2 pairs FT, TF are not.

Example: Spanish is a tau-language, Catalan and Galician are mu or pi languages, the pairs SC and SG are Tau-pairs because there are 2 quite good MT systems translating newspapers ofr these pairs (Comprendium, using the METAL shell, see Proc. EAMT-05).

2) Other terms proposed and why they are not good terms for these concepts

The terms

* minority languages
* less-prevalent languages
* less(er) widely used languages
* less-dominant (non-dominant) languages
* traditionally oral/spoken/unwritten languages
* endangered languages
* indigenous languages
* neglected languages
* New Member State languages (used for the new languages of the European Union)

don't really say anything about the degree of "equipement" as far as computer applications are concerned, and many of them are deprecating in some way.

(I agree 100% with Jeff Allen on that!)

The terms

* sparse-data languages
* low-density languages

also don't fit:

- The idea that data is "sparse" means there ARE data, but in fragmentary and heterogeneous form. But pi-languages often have NO data or resources usable, even for simple applications such as hyphenation -- where are "sparse data" for hyphenating khmer?

- "Low-density" is quite worse as it can only mean that a language is spoken by a small fraction of the population where it is spoken. But what can be the reference? A country? A region? -- To the extreme, almost any language is of high density in families where it is spoken.


About the 2 other terms proposed:

* commercially disadvantaged/inhibited/challenged languages
* low market-value languages

These terms also miss the point above. A language may suddenly acquire a high market value (see Chinese since 10-15 years), or lose it somewhat (e.g., Russian since 1991), this is independent of the resources and tools existing for it. The reason is that these are often developed NOT in order to build commercial products. Why were Eurodicautom, Euramis and EuroParl developed?

When NLP firms will discover than Malay/Indonesian can be commercially interesting, they will find there are quite a lot of resources for them, including a modern unified terminology (istilah). But if the same happens for tagalog (or maybe for swahili), they will find next to nothing usable to quickly build applications for them.


Best regards,

Ch.Boitet


At 1:44 +0200 26/08/06, Jeff Allen wrote:
 > > At 20:53 +0100 24/08/06, Francis Tyers wrote:
 > >... talking about Unicode issues for minority languages

Christian Boitet wrote:
 > Please don't use the term "minority language",
 > which is a bit deprecating and quite often plain
 > wrong, but rather the terms
 > - "pi-language" (poorly computerized language),
 > proposed by V.Berment in his PhD,
 > - or "under-resourced languages" (as in a WS at TALN-05).

Quoting Francis Tyers <[EMAIL PROTECTED]>:
 My apologies, no offence was intended. I'm not aware of any commonly
 accepted typology....

Fran has made a good point in stating that there is no commonly accepted
typology.  It seems to me that there are somewhere between 5-10 languages in
the world today which don't receive such a label, and the other 6,000 which are
constantly tagged with one or more of the following terms that are regularly
used in writings on this topic (including a lot mine from 1991-1994 &
1997-2001):

* minority languages
* less-prevalent languages
* less(er) widely used languages
* sparse-data languages
* low-density languages
* neglected languages
* less-dominant (non-dominant) languages
* traditionally oral/spoken/unwritten languages
* endangered languages
* indigenous languages
* New Member State languages (used for the new languages of the European Union)

This set of terms (among others) actually covers a range of connotations, but it
really has little to do with linguistics, and more with factors and issues
related to economics, politics and related power struggles.

I would even suggest considering the following term:

* commercially disadvantaged/inhibited/challenged languages

Although the term "minority" language might be low on the acceptability scale
for some people, a possibly worse term (despite a reality if might reflect)
would be:

* low market-value languages

Best,

Jeff

================================
Jeff ALLEN, PhD, certified ISO 9001:2000 Quality Auditor
EMEA Director of Support & Professional Services
SYSTRAN S.A, Paris, France
[EMAIL PROTECTED]
http://www.systransoft.com
http://www.linkedin.com/in/jeffallen



_______________________________________________
Mt-list mailing list