Merci Christian, replies to the last part of your message below...

> Last bit of thought: we should be more precise about the term
> "language".
> For example, "Chinese" is not 1 language, but several (no oral
> intercomprehension), with their
> dialects: Mandarin, Cantonese, Wu (Shanghainese is a dialect of Wu),
> Fujien, etc. Or Arabic, for that matter. And that is quite important
> for NLP.
> For instance, a morphological analyzer for Literal or Standard Arabic
> is almost useless for Iraki. The US now have more "resources" for Iraki
> than for Standard ArabicŠ

In the broad sense it may not be possible to have a consistent definition of
"language" that is applicable to all uses. Ethnologue has a consistent
approach, though very clearly a "splitter" one. That may be more useful in
some areas (perhaps MT?) than others (like software localization). There is
not AFAIK an index of "languageness," that is how "independent" a language
is, or whether it exists as a variant very close to one or more others,
whether it or another is a standard for a wider range of uses than the other
closely-related tongues, etc. (Incidentally a low languageness index, if
there were such a thing, might indicate potential use of certain kinds of MT
["shallow transfer" models, as far as I understand the term] among the
related languages.)

We get into some interesting and complex areas with situations like Chinese
and Arabic as you describe - what do you call a language that is pretty much
the same written, but different languages spoken? Or when there is
coexistence of related standard/written form and colloquial/spoken forms?
> Concerning African languages, I was told the situation is the same,
> with many dialects and sometimes different writing systems
> (missionaries from various countries and confessions created them
> almost indemendently). For NLP systems to be really useful, then, they
> must be "tuned" to these variants. Also, we should more and more take
> into account that, although all are technically "written", their use is
> mostly oral, and their speakers rarely write or read in them.
> Then, unity provided by a common script is somewhat destroyed: systems
> have to become increasingly "directly oral".

I'm actually (re)writing something that touches on these issues. Writing
systems are sometimes multiple for the same tongue, but nowadays that might
be due to differences in country language policies (as pertain to
orthographies within their borders - borders that very often split language
communities); there are also as you mention sometimes legacies of divergent
missionary approaches (an example are the orthographies for Twi Ashanti, Twi
Akuapem, and Fanti in Ghana). The situation, though, is often dynamic &
changing which is both good and bad news: good in that standardized or
unified forms benefit wider use, but bad for NLP or localization when they
are in flux due to the transition not being complete or completely adopted.

Your mention of possible "directly oral" approaches is very much on target.
I do, however, see this as a family of technologies including audio-based
applications, speech <-> text transformation softwares, and of course
computer translation programs (MT. translation memory). The sum object would
be to make the transition among languages and forms of expression more
seamless. I've had discussions where the notion of written + "neo-oral"
culture in Africa has been mentioned. That's talking big & vague, but as far
as I see from the technology anyway, there is a lot that can be done in that

The bottom line is how can all these wonderful things ICT can do be made to
accommodate situations where there are many languages, often with oral
traditions, easy codeswitching by speakers, still low literacy/pluriliteracy
rates, and more access to cellphones than computers.

> With that trend, an important question has become how to adapt/reuse
> resources and tools from 1 "rho-" or "mu-" language to a variant which
> is still very much "pi-"!

This is true. Actually I have been looking at this system and others to help
categorize our working list of priority languages (and language
groups/clusters) at (some
experiments offline). The idea being larger strategies for languages in
which one could divide such a priority list into areas for attention and

I also hope that in the case of languages in Africa it will be possible to
develop some novel approaches in developing applications, not only adapting
& reusing what is created elsewhere.

Don Osborn
PanAfrican Localisation project

