Hi,             15/1/07

At 18:47 -0400 25/08/06, Don Osborn wrote:
Thanks to all who have responded in this thread. I will follow up offline.

Re the question of terminology, and "minority" languages in particular, here
are a few quick thoughts (with apologies for taking this off on a tangent):

1. I hadn't thought of minority being offensive, but I guess we need to be
attentive to such matters. The main problem with the term I saw was its
imprecision. There was not long ago a project to compile information on
"minority" languages. To the surprise of a few people asked about it,
including me, Hausa was one of them (next to Swahili it supposedly has the
highest speakership of all African languages). But when we discussed it
further, the criteria indeed seemed to admit it: In "Hausaland" across much
of Niger and Nigeria it is the main language, but Hausaphones are minorities
elsewhere and it is spoken as a trade language by some people further away.
However, by extension, then, just about every other language in Africa is
"minority" as well. What capped it was discovering that Chinese also
qualified as a minority language - which it is in fact in many countries,
though we wouldn't think to call it, or Spanish or English, etc. As Francis
puts it, "situational" minority languages. But that just shows how dependent
the term is on context.

yes!

2. So people grope for an appropriate term. For more widely spoken
languages, "LWC" for "language of wider communication" emerged at some point
(rather like lingua francas, but let's not try to sort out the difference
between those two here). And at the other extreme there are "endangered
languages" about which, although definitions can vary, there is a generally
accepted sense of what it means (though even on that I've read references to
Igbo, a language spoken by somewhere on the order of 20 million people
described as "endangered" - but let's not delve into the issues there
either). But in between those two what do you say? "Small" languages as
shorthand for "less widely spoken languages" are more appropriately spoken
of as the latter - but that's too cumbersome. In Europe there was the term
"lesser-used languages" but with uncertain implications - less people speak
then or those that do use them less or both? "Local" languages is one that
I've tried to avoid lately because it seems to me to be used in a way that
reduces the languages status, and is applied only in some parts of the world
(and what of "local" when you have, say, Wolof-speaking merchants in New
York and Paris, for instance?). In Francophone countries the term "langue
partenaire" has been coined, but that raises questions of what kind of
partnership, and who's partner with whom and why and so on

3. A lot depends of course on context. "Under-resourced languages" is very
descriptive for ICT contexts and even some traditional technologies (e.g.,
no textbooks in so many less-widely-spoken-languages for the better part of
the past century - now that's under-resourced). But maybe not in demographic
or sociolinguistic contexts. Just for an example, Fula definitely is "under
resourced" in the technical and monetary sense, but definitely not
linguistically (e.g., its lexicon is staggering - there's a large dictionary
of the roots alone). "Less commonly taught languages" (LCTLs) is purely an
academic reference. "Pi-language" is a new one on me but seems to be mainly
a technical reference (pi=poorly informatisées or what?).

- "pi-language" (pi for "poorly equipped" or "having poor [NLP] resources" is a
term coined by Vincent Berment in his PhD on the computerization of groups of pi-languages, 2004, applications to Lao, Khmer, Bengali, ThaiŠ

- "under-resourced language" was introduced in 2005 by a native English colleague I can't remember the name of at the moment, to translate the title and description of a workshop dedicated to NLP for pi-languages at TALN-05.


4. I ran into this problem personally when I wanted a way to refer to a very
wide class of languages not counting the LWCs as LWCs, and came up with an
acronym that I think covers the intended field and is in itself
"constructively ambiguous": MINEL - where M is maternal (which is every
language, but here the emphasis is on this role as opposed to the 2nd
language role) or minority (sorry!); I is indigenous (which also can mean
anything, but here meant in the sense of languages of "indigenous peoples";
N is "national" which is an appellation more common in Francophone countries
especially in Africa and is *not* the same as official; E is "endangered,"
or "ethnic" which one will hear with regard to languages in some parts of
the world (funny that a language might be referred to as ethnic and not
indigenous or vice-vera, but the criteria for the distinction are arguable);
and L could be "less-widely-spoken" or even "local" or, well, language.

That about runs the gamut, from what I have. Hope all have a good weekend
(some of you are in the midst of it and others just starting, and some of us
will work through it either way!).

Don Osborn


The term "minority languages" is not only offensive, it is plain wrong. To call malayalam or bengali "minority languages" because they are languages of minorities in England implies total disregard for the fact that they are "majority" languages in their states.

I understand this discussion is taking part amoung NLP researchers. In this context, I think we should use terms which correspond to our problems. For our work, or research, the fact that a language is or is not widely spoken and/or written is of no intrinsic importance. What is important is rather whether
- large dictionaries and corpora are available
- a large quantity of new texts / speeches is produced, and available
- same for bilingual dictionaries and parallel texts (here we may introduce the term "pi-pair" for pairs such as French-Thai, where the 2 languages are "rho-languages", rho for "richly equipped").

That does not correlate with population size (as native language) or usage area (as vernacular language).

For example, Japan produces 7 or 8% of all the scientific literature, more in Japanese than in English as not everything is translated into English while a lot of original English scientific and technical is translated into Japanese cover to cover, while the Japanese speaking "base" is less than that of Malay-Indonesian.

By contrast, bengali has about 300M native speakers, but tools ot use bengali under Office have only recently started to appear (BanglaWord by V.Berment in 2004), while those for lao were created (by V.Berment and others) from 1996 on.

Another case: Latin is not spoken any more, but it may be called a rho-language, as all the modern vocabulary is created in Latin by scholars at the Vatican, and all new names of species etc. are still directly created in Latin. It also has corpora and grammars. Or, because there are no large new corpora, we might call it a "mu-language" (medium-equipped?): certainly, the size of available parallel Latin-Lx corpora is not enough for building good statisticfal MT systems, but it is still possible to build good RBMT systems for Latin, if desired !



Last bit of thought: we should be more precise about the term "language".
For example, "Chinese" is not 1 language, but several (no oral intercomprehension), with their dialects: Mandarin, Cantonese, Wu (Shanghainese is a dialect of Wu), Fujien, etc. Or Arabic, for that matter. And that is quite important for NLP. For instance, a morphological analyzer for Literal or Standard Arabic is almost useless for Iraki. The US now have more "resources" for Iraki than for Standard ArabicŠ

Concerning African languages, I was told the situation is the same, with many dialects and sometimes different writing systems (missionaries from various countries and confessions created them almost indemendently). For NLP systems to be really useful, then, they must be "tuned" to these variants. Also, we should more and more take into account that, although all are technically "written", their use is mostly oral, and their speakers rarely write or read in them. Then, unity provided by a common script is somewhat destroyed: systems have to become increasingly "directly oral".

With that trend, an important question has become how to adapt/reuse resources and tools from 1 "rho-" or "mu-" language to a variant which is still very much "pi-"!

Best regards,

Ch.Boitet
--
-------------------------------------------------------------------------
Christian Boitet
(Pr. Universite' Joseph Fourier)       Tel: +33 (0)4 76 51 43 55/48 17
GETA, CLIPS, IMAG-campus, BP53         Fax: +33 (0)4 76 44 66 75/51 44 05
385, rue de la Bibliothe`que           Mel: [EMAIL PROTECTED]
38041 Grenoble Cedex 9, France

_______________________________________________
Mt-list mailing list

Reply via email to