I recently contributed some UTF-8 support to a couple of projects, which I will describe in case anyone has any advice for me.
http://sourceforge.net/projects/freedb/ This is a cddbp database server. You give it the precise track lengths of a CD and it will supply the track titles if someone has already entered them, or you can contribute them yourself. For example, Debian has a script called abcde for converting an entire CD to Ogg Vorbis files which queries a cddbp server automatically so that it can add tags to the Ogg Vorbis files for you. There are various ways of communicating with the server, but all of them include an explicit protocol level except e-mail, which has MIME. Up until now ISO-8859-1 has been prescribed. My proposal is to define protocol level 6 to be the same as 5 but with UTF-8 prescribed. The server takes care of charset conversion and can be configured to automatically detect the encoding of disc files, so an existing database can be used without conversion but new files can be added in UTF-8. When UTF-8 data is supplied to an ISO-8859-1 client the server has to transliterate. The first problem is to provide a good transliteration table: glibc and libiconv don't transliterate Cyrillics, I think, so can anyone recommend such a table? The second problem is to avoid transliterated data being edited by a user then recontributed as a correction. Ideally we wouldn't accept an ISO-8859-1 update to a file that contains non-ISO-8859-1, but unfortunately updates are merged off-line by a different process, which means it would be messy to implement, so we might just make do with including a warning in the CD title when data has been transliterated approximately and trusting the user to understand it. http://www.xiph.org/ogg/vorbis/ This is the free replacement for MP3. The Ogg Vorbis format prescribes UTF-8, but data has to be converted for the client. My suggestion to require iconv was not welcomed, so I provided both a converter using iconv and a simple built-in one with a config test to choose between them. The built-in converted does UTF-8 and 8-bit encodings. It would be useful if anyone could provide a list of 8-bit encodings worth including. An encoding is worth including if it is widely used by people who don't have iconv, and a name of such an encoding is worth including if it might be returned by nl_langinfo(CODESET) on a system without iconv. At present the code uses nl_langifno(CODESET), where available, to get the user's charset. Otherwise it looks at the environment variable CHARSET. Otherwise it assumes US-ASCII. In general, when converting, illegal input bytes are replaced by '#' and unrepresentable characters are replaced by '?'. The function to convert a buffer using iconv is about 200 lines of C, mainly because of faults in the design of iconv's API, which mean you have to convert the data 3 times: you have to go via UTF-8 to distinguish the '#' and '?' cases, and you have to convert from UTF-8 twice to avoid having E2BIG mask the return value telling you that the conversion was inexact. Also, I have to support both the standard iconv and the various versions provided by glibc/libiconv, so I'm not totally happy with iconv. Edmund - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
