Merci Christian, replies to the last part of your message below... ... > Last bit of thought: we should be more precise about the term > "language". > For example, "Chinese" is not 1 language, but several (no oral > intercomprehension), with their > dialects: Mandarin, Cantonese, Wu (Shanghainese is a dialect of Wu), > Fujien, etc. Or Arabic, for that matter. And that is quite important > for NLP. > For instance, a morphological analyzer for Literal or Standard Arabic > is almost useless for Iraki. The US now have more "resources" for Iraki > than for Standard ArabicŠ
In the broad sense it may not be possible to have a consistent definition of "language" that is applicable to all uses. Ethnologue has a consistent approach, though very clearly a "splitter" one. That may be more useful in some areas (perhaps MT?) than others (like software localization). There is not AFAIK an index of "languageness," that is how "independent" a language is, or whether it exists as a variant very close to one or more others, whether it or another is a standard for a wider range of uses than the other closely-related tongues, etc. (Incidentally a low languageness index, if there were such a thing, might indicate potential use of certain kinds of MT ["shallow transfer" models, as far as I understand the term] among the related languages.) We get into some interesting and complex areas with situations like Chinese and Arabic as you describe - what do you call a language that is pretty much the same written, but different languages spoken? Or when there is coexistence of related standard/written form and colloquial/spoken forms? > Concerning African languages, I was told the situation is the same, > with many dialects and sometimes different writing systems > (missionaries from various countries and confessions created them > almost indemendently). For NLP systems to be really useful, then, they > must be "tuned" to these variants. Also, we should more and more take > into account that, although all are technically "written", their use is > mostly oral, and their speakers rarely write or read in them. > Then, unity provided by a common script is somewhat destroyed: systems > have to become increasingly "directly oral". I'm actually (re)writing something that touches on these issues. Writing systems are sometimes multiple for the same tongue, but nowadays that might be due to differences in country language policies (as pertain to orthographies within their borders - borders that very often split language communities); there are also as you mention sometimes legacies of divergent missionary approaches (an example are the orthographies for Twi Ashanti, Twi Akuapem, and Fanti in Ghana). The situation, though, is often dynamic & changing which is both good and bad news: good in that standardized or unified forms benefit wider use, but bad for NLP or localization when they are in flux due to the transition not being complete or completely adopted. Your mention of possible "directly oral" approaches is very much on target. I do, however, see this as a family of technologies including audio-based applications, speech <-> text transformation softwares, and of course computer translation programs (MT. translation memory). The sum object would be to make the transition among languages and forms of expression more seamless. I've had discussions where the notion of written + "neo-oral" culture in Africa has been mentioned. That's talking big & vague, but as far as I see from the technology anyway, there is a lot that can be done in that direction. The bottom line is how can all these wonderful things ICT can do be made to accommodate situations where there are many languages, often with oral traditions, easy codeswitching by speakers, still low literacy/pluriliteracy rates, and more access to cellphones than computers. > With that trend, an important question has become how to adapt/reuse > resources and tools from 1 "rho-" or "mu-" language to a variant which > is still very much "pi-"! This is true. Actually I have been looking at this system and others to help categorize our working list of priority languages (and language groups/clusters) at http://www.panafril10n.org/wikidoc/pmwiki.php/PanAfrLoc/MajorLanguages (some experiments offline). The idea being larger strategies for languages in which one could divide such a priority list into areas for attention and support. I also hope that in the case of languages in Africa it will be possible to develop some novel approaches in developing applications, not only adapting & reusing what is created elsewhere. Don Osborn Bisharat.net PanAfrican Localisation project > Best regards, > > Ch.Boitet > -- > ----------------------------------------------------------------------- > -- > Christian Boitet > (Pr. Universite' Joseph Fourier) Tel: +33 (0)4 76 51 43 55/48 17 > GETA, CLIPS, IMAG-campus, BP53 Fax: +33 (0)4 76 44 66 75/51 44 > 05 > 385, rue de la Bibliothe`que Mel: [EMAIL PROTECTED] > 38041 Grenoble Cedex 9, France _______________________________________________ Mt-list mailing list