Standardized encoding names for iconv_open()

Markus Kuhn Wed, 19 May 2004 09:17:12 -0700

Bruno Haible wrote on 2004-05-19 10:45 UTC:
> Tomohiro KUBOTA wrote:
> > iconv_t ic = iconv_open("UTF-8",nl_langinfo(CODESET));
> 
> When you use GNU libc or GNU libiconv but your platform lacks
> nl_langinfo(CODESET) (like for example FreeBSD 4), then you can use
> the "" alias instead. It has the same meaning: the locale dependent char*
> encoding:
> 
> iconv_t ic = iconv_open("UTF-8","");


This convention of interpreting "" given as an argument for iconv_open()
as the multi-byte encoding specified by the current LC_CTYPE locale
sounds to me like something useful enough to have it actually added to
the POSIX definition of iconv_open()!

In general, the POSIX definition of iconv_open() would become *much*
more useful, if it actually specified a couple of encoding strings, and
what exactly they mean. At the moment

  http://www.opengroup.org/onlinepubs/009695399/functions/iconv_open.html

merely states that

  Settings of fromcode and tocode and their permitted combinations
  are implementation-defined.

which means that portable applications can't use iconv_open() for
converting any of the encodings widely used in MIME body parts, for
example.

I would like to suggest, that at least all combinations of the following
values for the fromcode and tocode values of iconv_open() should have a
fixed portable meaning in the next revision of POSIX:

  ""                   multi-byte encoding of current LC_CTYPE locale
  "UTF-8"              UTF-8 (with overlong sequences being illegal)
  "UTF-16"             UTF-16 (same byte order as C's short)
  "UTF-16BE"           UTF-16 BigEndian
  "UTF-16LE"           UTF-16 LittleEndian
  "UTF-32"             UTF-32 (same byte order as C's long)
  "UTF-32BE"           UTF-16 BigEndian
  "UTF-32LE"           UTF-32 LittleEndian
  "ISO-8859-1"         ISO 8859-1

As long as LC_CTYPE ("") uses either UTF-8 or ISO 8859-1, all these
conversions can be implemented trivially without refering to any
conversion tables (which is why I have not added any other ISO 8859
parts to this list). This minimum requirement would therefore not put
any unreasonable burden on the implementors of even low-memory footprint
implementations.

The above encodings appear in so many file and protocol formats today,
that having them supported in the minimally required implementation of
iconv_open() would save developpers a lot of time, who otherwise have to
reinvent the wheel again and again.

A couple of other encoding names could be listed, to define at least
their meaning, but their implementation should certainly be optional, as
they require conversion tables, in particular

  "ISO-8859-2" .. "ISO-8859-16"
  "ISO-6937"

and perhaps even

  "EUC-JP", "EUC-KR", "EUC-TW", "GB18030"

if we can find appropriate quoteable references to the formal
specification of these.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Standardized encoding names for iconv_open()

Reply via email to