Juliusz Chroboczek wrote:
> I think many people agree that it is good to
> explore various ways of proceeding before committing to a single one.
> ...
> Feel free to use iconv in your code, but I'm not going to use it in
> mine.
Have you explored various ways of doing the conversion, including
iconv?
> The second reason is that I personally dislike iconv, and feel that
> such a beast has no place in libc.
In any case, iconv is in POSIX, and on Linux systems, it is in libc.
Whether you like it or not.
Now that it is in libc, it is desirable for all programs to use it,
as far as possible, for the following reasons:
1) The risk of disagreement. Nothing's worse than a character that
looks differently to different programs. If converters are
put into several programs, there is a risk of disagreeing
tables.
2) Ease of upgrading. Do you feel people who need new encodings,
like CP1251 or GB18030, should touch every piece of software
in order to get their locale working?
3) Memory consumption. On a virtual memory Unix, multiple copies
of the /usr/lib/gconv/ENCODING.so converter will be present
in memory once only, regardless how many programs use it. But
if every program comes with its own BIG5 conversion table
(possibly even allocated in data or malloc segment, not in the
text segment), memory consumption goes up unnecessarily.
> iconv is not designed for live streams,
> but for converting static strings; thus, it does not deal with
> resynchronisation well. This is not simply an implementation issue --
> iconv does not provide the necessary interfaces to deal with
> resynchronisation. (Or, more exactly, it does not provide all the
> necessary interfaces.)
> ...
> Because luit contains carefully hand-crafted resynchronisation code.
> While I have not proved it, I believe that the current implementation
> of resynchronisation for Big 5 is optimal within the constraints of
> one byte of memory and no lookahead.
If you say so, I believe you that resynchronization in luit is best
done this way, using the structure of the encoding.
Still it seems you could get rid of the problems (1) and (3) above by
using your existing infrastructure for determining the character
boundaries and for resynchronization, but using iconv() for doing the
conversion from/to Unicode. That would at least eleminate two problems
out of three.
Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/