Comment on:

  Locale name guideline [Public Review Draft 2001-05-31]
  http://www.li18nux.org/docs/text/locale-name-20010531.txt

>     The following is an example of standard values for the CODESET field.
>
>          "UTF-8",
>          "ISO-8859-1", "ISO-8859-2", "ISO-8859-5", "ISO-8859-7",
>          "ISO-8859-9", "ISO-8859-13", "ISO-8859-15",
>          "GB-2312", "GB-18030", "EUC-KR", "EUC-JP", "EUC-TW"
>          "IBM-943", "MS-932", and "TCA-BIG5"

Proposal:

I suggest the list of standardised encoding names to be limited to
*exactly* the following 19 character encoding names:

 "UTF-8",
 "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-5", "ISO-8859-6",
 "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-13", "ISO-8859-15",
 "EUC-JP", "EUC-KR", "GB2312",
 "KOI8-R", "KOI8-U", "VISCII",
 "WINDOWS-1251", "WINDOWS-1256"

The defining documents of all these are listed in

  http://www.iana.org/assignments/character-sets

under exactly these encoding names, which also happen to be the exact
preferred MIME charset names. (Note that for the ISO-8859-6 and
ISO-8859-8 encodings, RFC 1556 implies "visual directionality", so you
must use UTF-8 if you want implicit or explicit directionality!)

That is a nice, finite, safe and manageable set. I do not think, I
forgot any encoding that fits the following criteria.

Do not include:

  a) character encodings that contain ASCII bytes in non-ASCII multibyte
     sequences (so BIG5, GB 18030, SJIS are not qualified I'm afraid, you
     really should use UTF-8 or EUC-* instead)

  b) character encodings that are not listed in the IANA registry
     under the proposed name as the preferred MIME name (so EUC-TW
     as described in Ken Lunde's book is not qualified unfortunately,
     and EUC-CN has to be called GB2312)

  c) character encodings that are used daily as locales on POSIX systems
     by fewer than 50 people on planet earth (UTF-8 actually starts to
     qualify here !!! ;-)

  d) character encodings other than UTF-8 with combining characters
     (so TIS-620 is not qualified)

  e) character encodings other than UTF-8 with ligature substitution
     requirements

Character encodings in categories d)-e) are today anyway practically not
usable with most POSIX applications, so we serve nobody by encouraging
support for these encodings. Software that knows about bidirectionality,
combining characters and ligature substitution today can really also be
expected to know about UTF-8.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to