Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Ben Wiley Sittler Fri, 27 Apr 2007 12:42:10 -0700

glad it was rejected. the only really sensible approach i have yet
seen is utf-8b (see my take on it here:
http://bsittler.livejournal.com/10381.html and another implementation
here: http://hyperreal.org/~est/utf-8b/ )


the utf-8b approach is superior to many others in that binary is
preserved, but it does not inject control characters. instead it is an
extension to utf-8 that allows all byte sequences, both those that are
valid utf-8 and those that are not. when converting utf-8 <-> utf-16,
the bytes in invalid utf-8 sequences <-> unpaired utf-16 surrogates.
the correspondence is 1-1, so data is never lost. valid paired
surrogates are unaffected (and are used for characters outside the
bmp.)

i realize i've mentioned this before, but i feel i should mention it
whenever someone mentions a non-data-preserving proposal (like
converting everything invalid to U+FFFD REPLACEMENT CHARACTER) or an
actively harmful proposal (like converting invalid bytes into U+001A
SUB which has well-defined and sometimes-destructive semantics.)

On 4/27/07, Christopher Fynn <[EMAIL PROTECTED]> wrote:

Rich Felker wrote:
> On Fri, Apr 27, 2007 at 05:15:16PM +0600, Christopher Fynn wrote:
>> N3266 was discussed and rejected by WG2 yesterday. As you pointed out
>> there are all sorts of problems with this proposal, and accepting it
>> would break many existing implementations.

> That's good to hear. In followup, I think the whole idea of trying to
> standardize error handling is flawed. What you should do when
> encountering invalid data varies a lot depending on the application.
> For filenames or text file contents you probably want to avoid
> corrupting them at all costs, even if they contain illegal sequences,
> to avoid catastrophic data loss or vulnerabilities. On the other hand,
> when presenting or converting data, there are many approaches that are
> all acceptable. These include dropping the corrupt data, replacing it
> with U+FFFD, or even interpreting the individual bytes according to a
> likely legacy codepage. This last option is popular for example in IRC
> clients and works well to deal with the stragglers who refuse to
> upgrade their clients to use UTF-8. Also, some applications may wish
> to give fatal errors and refuse to process data at all unless it's
> valid to begin with.
>
> Rich


Yes. Someone who was there tells me the main reason it was rejected was
that it was considered out of scope for ISO 10646 or even Unicode to
dictate what a process should do in an error condition. Should it throw
an exception, etc. etc. The UTF-8 validity specification is expressed in
terms of what constitutes a valid string or substring rather than what a
process needs to do in a given condition. Neither standard wants to get
into the game of standardizing API type things like what processes
should do.

- Chris
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/
>


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Reply via email to