I recently contributed some UTF-8 support to a couple of projects,
which I will describe in case anyone has any advice for me.

http://sourceforge.net/projects/freedb/

This is a cddbp database server. You give it the precise track lengths
of a CD and it will supply the track titles if someone has already
entered them, or you can contribute them yourself. For example, Debian
has a script called abcde for converting an entire CD to Ogg Vorbis
files which queries a cddbp server automatically so that it can add
tags to the Ogg Vorbis files for you.

There are various ways of communicating with the server, but all of
them include an explicit protocol level except e-mail, which has MIME.
Up until now ISO-8859-1 has been prescribed. My proposal is to define
protocol level 6 to be the same as 5 but with UTF-8 prescribed. The
server takes care of charset conversion and can be configured to
automatically detect the encoding of disc files, so an existing
database can be used without conversion but new files can be added in
UTF-8.

When UTF-8 data is supplied to an ISO-8859-1 client the server has to
transliterate. The first problem is to provide a good transliteration
table: glibc and libiconv don't transliterate Cyrillics, I think, so
can anyone recommend such a table? The second problem is to avoid
transliterated data being edited by a user then recontributed as a
correction. Ideally we wouldn't accept an ISO-8859-1 update to a file
that contains non-ISO-8859-1, but unfortunately updates are merged
off-line by a different process, which means it would be messy to
implement, so we might just make do with including a warning in the CD
title when data has been transliterated approximately and trusting the
user to understand it.

http://www.xiph.org/ogg/vorbis/

This is the free replacement for MP3. The Ogg Vorbis format prescribes
UTF-8, but data has to be converted for the client. My suggestion to
require iconv was not welcomed, so I provided both a converter using
iconv and a simple built-in one with a config test to choose between
them. The built-in converted does UTF-8 and 8-bit encodings. It would
be useful if anyone could provide a list of 8-bit encodings worth
including. An encoding is worth including if it is widely used by
people who don't have iconv, and a name of such an encoding is worth
including if it might be returned by nl_langinfo(CODESET) on a system
without iconv.

At present the code uses nl_langifno(CODESET), where available, to get
the user's charset. Otherwise it looks at the environment variable
CHARSET. Otherwise it assumes US-ASCII. In general, when converting,
illegal input bytes are replaced by '#' and unrepresentable characters
are replaced by '?'.

The function to convert a buffer using iconv is about 200 lines of C,
mainly because of faults in the design of iconv's API, which mean you
have to convert the data 3 times: you have to go via UTF-8 to
distinguish the '#' and '?' cases, and you have to convert from UTF-8
twice to avoid having E2BIG mask the return value telling you that the
conversion was inexact. Also, I have to support both the standard
iconv and the various versions provided by glibc/libiconv, so I'm not
totally happy with iconv.

Edmund
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to