Bruno Haible wrote on 2004-05-19 10:45 UTC:
> Tomohiro KUBOTA wrote:
> > iconv_t ic = iconv_open("UTF-8",nl_langinfo(CODESET));
>
> When you use GNU libc or GNU libiconv but your platform lacks
> nl_langinfo(CODESET) (like for example FreeBSD 4), then you can use
> the "" alias instead. It has the same meaning: the locale dependent char*
> encoding:
>
> iconv_t ic = iconv_open("UTF-8","");
This convention of interpreting "" given as an argument for iconv_open()
as the multi-byte encoding specified by the current LC_CTYPE locale
sounds to me like something useful enough to have it actually added to
the POSIX definition of iconv_open()!
In general, the POSIX definition of iconv_open() would become *much*
more useful, if it actually specified a couple of encoding strings, and
what exactly they mean. At the moment
http://www.opengroup.org/onlinepubs/009695399/functions/iconv_open.html
merely states that
Settings of fromcode and tocode and their permitted combinations
are implementation-defined.
which means that portable applications can't use iconv_open() for
converting any of the encodings widely used in MIME body parts, for
example.
I would like to suggest, that at least all combinations of the following
values for the fromcode and tocode values of iconv_open() should have a
fixed portable meaning in the next revision of POSIX:
"" multi-byte encoding of current LC_CTYPE locale
"UTF-8" UTF-8 (with overlong sequences being illegal)
"UTF-16" UTF-16 (same byte order as C's short)
"UTF-16BE" UTF-16 BigEndian
"UTF-16LE" UTF-16 LittleEndian
"UTF-32" UTF-32 (same byte order as C's long)
"UTF-32BE" UTF-16 BigEndian
"UTF-32LE" UTF-32 LittleEndian
"ISO-8859-1" ISO 8859-1
As long as LC_CTYPE ("") uses either UTF-8 or ISO 8859-1, all these
conversions can be implemented trivially without refering to any
conversion tables (which is why I have not added any other ISO 8859
parts to this list). This minimum requirement would therefore not put
any unreasonable burden on the implementors of even low-memory footprint
implementations.
The above encodings appear in so many file and protocol formats today,
that having them supported in the minimally required implementation of
iconv_open() would save developpers a lot of time, who otherwise have to
reinvent the wheel again and again.
A couple of other encoding names could be listed, to define at least
their meaning, but their implementation should certainly be optional, as
they require conversion tables, in particular
"ISO-8859-2" .. "ISO-8859-16"
"ISO-6937"
and perhaps even
"EUC-JP", "EUC-KR", "EUC-TW", "GB18030"
if we can find appropriate quoteable references to the formal
specification of these.
Markus
--
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/