Re: Binary transparency lost in UTF-8 tools

Henry Spencer Sun, 06 Jul 2003 12:59:03 -0700

On Sun, 6 Jul 2003, Beni Cherniavsky wrote:
> One thing that worries me about UTF-8 is that filters that previously
> could handle any binary input (at least in their GNU incarnations ;)
> now abort on any binary garbage since it's not legal UTF-8.  I'd like
> to see non-abortive modes.


In practice, you really need a text/binary distinction for I/O, which
handles this as a lesser issue.  But even for text-only programs, there
is some use for a "handle malformed input" option.

> ...it would sometimes be more useful to pass all
> invalid data as-is.  A dirty hack to tunnel them through UTF-8
> decoding->encoding would be to map each byte of an invalid sequence
> them to one of some 128 unicode codepoints (e.g. from the PUA)...

You really need some sort of escape convention, so if one of those
codepoints legitimately comes up, you can handle it (perhaps by pretending
it is an invalid sequence!).  Arguably, the same issue arises for things
like U+FFFF, which are guaranteed to be non-characters and hence should
never appear in input.

                                                          Henry Spencer
                                                       [EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Binary transparency lost in UTF-8 tools

Reply via email to