Peter_Constable
Tue, 12 Sep 2000 20:57:36 -0700
On 09/12/2000 08:08:14 PM "Christopher J. Fynn" wrote: >I'm not qualified to judge the merits of one list over another >but there certaily are other comprehensive and well researched >lists e.g. the Linguasphere Register of the World's Languages >and Speech Communities see: http://www.linguasphere.org/ > >Unfortunately their list is not available online, you have to buy >the book - a bit like ISO/IEC 10646 and many other standards >:-) > >I do know that the way the compilers of the Linguasphere have >classified languages and dialects is different than the way the >compilers of the Ethnolouge have - though I'm sure both could >give you well reasoned arguments why their scheme is better >or more useful than the other. I think the Linguasphere is a valueable publication, and the only alternative I'm aware of that is a contender in place of the Ethnologue. My concerns about it are: - As Chris mentioned, the info isn't available online. I consider the availability of online documentation to back up a set of codes to be essential. Otherwise, there is no easy way for users to find out what things mean. - The Linguasphere uses a hierarchical system that begins with 10 divisions in each of 10 major regions. This was done specifically to avoid questions about higher-level genetic relationships, but the divisions end up being rather arbitrary. The languages of the world do not in fact neatly divide into 10 major groups in each of 10 major regions. - There is a multi-level hierarchy that begins at levels above what the Ethnologue considers to be a language, and goes below that level. There is no certainty that one category in one place within the Linguasphere catalog that is at a given level represents exactly the same kind of object as other categories at the same level elsewhere in the catalog. Also, it is not clear which of these levels are or are not useful for the purposes of language-specific processing. In contrast, it is our experience that the categories reflected in the Ethnologue are the most generally useful for language-specific processing. There are some exceptions to this (e.g. Murray Sargent pointed out that there are regional-variant spelling conventions for English), but these are the exception rather than the norm. Note also that something like spell checking involves a *paralinguistic* notion, viz. spelling/orthographic conventions, rather than the notion of *language* itself. There are clearly cases of language-specific processing which will need to rely on some paralinguistic notion such as "spelling/orthographic convention" or "writing system". On the one hand, this area is not yet well enough understood to come up with comprehensive enumerations of identifiers for these various purposes. Secondly, identifiers that are appropriate purposes will generally build from a set of *language* identifiers as a starting point. (E.g. if you're going to enumerate writing systems, you'll need to begin with an enumeration of languages.) As Rick responded to Murray, Ethnologue codes don't solve all problems, but they do give us a comprehensive list of modern languages that represents a good starting point from which to work. So, for these three reasons, I don't think the Linguasphere is as good a choice for language identifiers for IT purposes. It would be useful for documenting what identifiers within some system of identifiers denote, except that the information is not available online. Some are of the opinion that a hierarchical system is needed. A few people at IUC17 commented that Ethnologue codes should be supplemented in this way. Two comments: 1. Someone in the discussion time pointed out that there are many possible alternate hierarchies based on orthogonal factors (e.g. inferred genetic relationship, historical connections, geographic proximity, linguistic similarity, related writing traditions, ...). It would be impossible to have a single hierarchy that does all of this. (One further comment about Linguasphere: I haven't read all of the introductory material, but there is an indication that the choice was made to *not* base the hierarchy on inferred genetic relationships since this was not considered relevant for understanding the current socio-linguistic settings of language communities. That raises the question of just what basis Linguasphere's hierarchy *is* built on - it's not clear to me what this is.) 2. I don't think there is a clear understanding of what purposes hierarchical categories would serve. Certainly a hierarchical, non-leaf-node category can be useful for subject indexing (e.g. to find any materials about Uto-Aztecan languages), but I don't think it's clear what other useful purpose such a category would serve. I think it would be better that identifiers for subject catalogs *not* get mixed up with identifiers for language-specific information processing. In general, non-leaf-node categories (such as Uto-Aztecan) are not useful for language-specific processing. E.g. if all you know about the language of an information object is that it is some Uto-Aztecan language, then you don't have enough information to successfully spell-check. A comprehensive list of leaf-node identifiers clearly would be useful, however. We should begin by adopting such a set, and revisit the issue of hierarchical non-leaf-node identifiers after their usefulness is better understood. - Peter --------------------------------------------------------------------------- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>