wcwidth, iconv, and Unicode

Marcin 'Qrczak' Kowalczyk Wed, 27 Sep 2000 06:18:29 -0700
I want to provide an equivalent of wcwidth for Haskell. Unicode only.

Sometimes I can use wcwidth from C. Sometimes I can use own
implementation, like <http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>
(should be simple because I already have the character database).
I don't know of other alternatives.

I need to know which implementation to use. I am currently thinking
about using wcwidth when it's available and __STDC_ISO_10646__ is
defined, and using a private implementation otherwise.

Is there a better strategy? This one implies that it will use the
private implementation under glibc-2.1.3, even though wcwidth is
present.

Maybe instead - or in addition to - checking for __STDC_ISO_10646__,
the configure script should test somehow if wcwidth behaves as
it should?

                        *       *       *

Another question. As you suggested, I am using iconv for the conversion
between the local byte encoding and Unicode (falling back to ISO-8859-1
if unavailable or unusable). To do this, the configure script needs
to find what flavors of Unicode iconv provides.

The current strategy is as follows. First, how to use iconv at all:
try to run a test program #including <iconv.h> and converting between
"ISO-8859-1" and "ISO-8859-1", first without linking any libraries,
then with -liconv. Trying to actually test a conversion seems necessary
because e.g. installing Konstantin Chuguev's iconv onto glibc-2.1.3
and using its <iconv.h> without -liconv produces programs that dump
core because of using his macros together with glibc's functions.

Then, how to talk with it. I run test programs trying to convert
a string "\300" from "ISO-8859-1" to encodings called "wchar_t",
"UCS-4-INTERNAL", "UCS-4", "UTF-8". For each of them I check whether
the result looks like one of: UCS-4 in native endianness, UCS-4 in BE,
UTF-8. Then I use the first name found for one of these encodings,
in order of my preference. If none found, fall back to ISO-8859-1.

The effect seems to work with iconv implementations from glibc-2.1.3,
Bruno and Konstantin. (Even though "UCS-4" means native endianness
on one of them and BE in others.)

If <langinfo.h> is present, I am using nl_langinfo(CODESET). If
not, fall back to ISO-8859-1. Perhaps it should be improved, but
nl_langinfo(CODESET) replacement I've seen in GNU fileutils is
messy. What systems do support iconv but don't support nl_langinfo?

This was configure time. At runtime: if iconv refuses to convert
between charsets determined in the above way, fall back to ISO-8859-1.
Note that since this conversion will be used by default in all I/O,
it absolutely must do something sensible at least for ASCII, and
preferably just pass other characters unmodified when in trouble.

Has anybody a better idea? What other encodings, or ways of linking,
or nl_langinfo replacements, are worth trying? I assume that if iconv
works at all, it will provide "ISO-8859-1" under this name.

A general disadvantage is that all binaries compiled with a Haskell
compiler that will eventually use this stuff will be dependent on
iconv. Hopefully it will not bite people in practice. They already
depend on libgmp.

I guess that iconv is not used on Windows at all. It will have to be
implemented by another person, somebody who knows how to do it and
can test it.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
wcwidth, iconv, and Unicode

Reply via email to