[ this is in response to a truly ancient linux-utf8 thread ] i wrote a patch that provides UTF-8 + binary in one codec with no hand-waving, using Markus Kuhn's brilliant proposal to encode invalid bytes 0xyz using unpaired surrogates U+DCyz. this means there need not be a text/binary distinction for UTF-8-using programs. legal UTF-8 decodes/encodes correctly, and other bytes are handled as "opaque" U+DCxx on input and correctly serialized on output. so one can once again consider editing a binary format with a "notepad"-type editor without sacrificing internationalization support.
Markus Kuhn's description of the idea: (search for "option d") http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html the patch: http://xent.com/~bsittler/libiconv-1.9.1-utf-8b.diff enjoy! (not sure how/whether this fits into the official distro, but i hope it gets used) -ben -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
