Tagging orthographic systems (was: (iso639.186) the Ethnologue)

2000-09-13 Thread Otto Stolz

Am 2000-09-12 um 17:43 h UCT hat Peter Constable geschrieben:
 ISO 639 codes were primarily intended for bibliography purposes.
 Gary and I point out in our paper that the needs of that sector do
 not necessarily correspond to the general needs of IT, particularly
 for language-specific processing. [...] For example, if all you know
 about the language of some information object is that it is an Athapascan
 language, you can't spell-check that information. The intro to ISO 639
 claims that the standard is intending to serve the needs of a variety of
 sectors, but in its current state it is failing to adequately serve some.
...
 Furthermore, we would contend that the categories enumerated in the
 Ethnologue by-and-large *are* the categories that need to be identified for
 general IT purposes. In the majority of cases, the distinctions made are
 those that would be needed to successfully spell-check, for example. (We
 acknowledge that that is not true in all cases; for example, Chinese
 spelling would cross multiple languages; and alternate English spellings
 are needed for what would generally be considered one language. But these
 are the exceptions, not the norm.)

For many language-specific IT processes involving written language,
such as spell-checking, hyphenating, transliterating (e. g. to Braille),
or audible rendering, it is not enough to know which language you are
dealing with: you also need information about the orthography used.

Orthography is subject to change over time, sometimes several orthograhies
for the same language co-exist, e. g. in transition time-spans or in
neighbouring countries.

For example,
- German orthography has been reformed in 1996; currently, two ortho-
  graphies are legal (e. g. accepted in school assignments): the old
  one, established in 1902, until 2005-07-31, and the new one, effective
  since 1998-08-01; cf. (in German)
  http://www.ids-mannheim.de/reform/zeitafel.html (time schedule),
  http://www.ids-mannheim.de/pub/sprachreport/sr98-extra.pdf (tutorial),
  and http://www.ids-mannheim.de/grammis/reform/inhalt.html (rules);
- France had an orthographic reform for French, in 1991;
- the Dutch spelling-reform of 1934 was enacted 1943 in Belgium,
  and 1947 in the Netherlands; Dutsch spelling was again (marginally)
  reformed in 1995, effective since 1996-08-01;
- Norwegian spelling was reformed in 1907, 1917, and 1938;
- Danish in 1948;
- Spanish in 1910, and again in 1852/55;
- Greek in 1982;
to name just a few. The co-existence of en_US and en_UK has already been
mentioned, im this thread.

Hence, I plead for a tagging-system that allows to represent these dif-
ferences. Currently, all of my WWW pages contain the line:
  HTML LANG=de!--neue Rechtschreibung--
I would rather prefer to incorporate the comment in the tag, as in
the hypothetical:
  HTML LANG=de-sp1996
and likewise for other languages, and other applications.

Note that this issue is orthogonal to the country code of RFC 1766.
E. g., both de-AT, de-CH and de-DE could be either spelled the 1902,
or the 1996, way. Hence, the spelling subtag, and the country subtag
should be optional, independend of each other.

I think, the ethnologue lacks information about variant orthographies.
(I last looked in it, a couple of months ago.) Both RFC 1766 and
ISO 639 ignore the issue of variant orthographies.

Best wishes,
   Otto Stolz



Tagging orthographic systems (was: (iso639.186) the Ethnologue)

2000-09-13 Thread Rick McGowan

Otto Stolz wrote:

 I think, the ethnologue lacks information about variant orthographies.

Yes, it does.  But that's OK, because we can make a composite tagging system that tags 
orthography separately from language.

So... does anyone have a comprehensive list of orthographies?

Rick


 


Re: Tagging orthographic systems (was: (iso639.186) the Ethnologue)

2000-09-13 Thread Peter_Constable


On 09/13/2000 09:09:12 AM Otto Stolz wrote:

For many language-specific IT processes involving written language,
such as spell-checking, hyphenating, transliterating (e. g. to Braille),
or audible rendering, it is not enough to know which language you are
dealing with: you also need information about the orthography used.

I *entirely* agree. But let us understand two points:

1. Orthography is not the only paralinguistic notion that IT processes
depend upon.

2. Except in a small number of cases, every category in a list of languages
will map to one or more categories in a list of writing systems (excluding
unwritten languages). In other words, the list of writing systems is a
finer enumeration than the list of languages. What that means is that, in
order to arrive at a comprehensive list of writing systems, you're going to
need a comprehensive list of languages anyway.



Note that this issue is orthogonal to the country code of RFC 1766.
E. g., both de-AT, de-CH and de-DE could be either spelled the 1902,
or the 1996, way. Hence, the spelling subtag, and the country subtag
should be optional, independend of each other.

I would agree.


I think, the ethnologue lacks information about variant orthographies.
(I last looked in it, a couple of months ago.) Both RFC 1766 and
ISO 639 ignore the issue of variant orthographies.

True.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]