Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Ben Wiley Sittler Fri, 27 Apr 2007 14:35:45 -0700

yes, i agree. the utf-8b approach is useful mainly when sending binary
data through a utf-16 channel with the hope of recovering it at the
far side. once byte string or character string manipulations are
performed, all bets are off.


On 4/27/07, Rich Felker <[EMAIL PROTECTED]> wrote:

On Fri, Apr 27, 2007 at 12:41:22PM -0700, Ben Wiley Sittler wrote:
> glad it was rejected. the only really sensible approach i have yet
> seen is utf-8b (see my take on it here:
> http://bsittler.livejournal.com/10381.html and another implementation
> here: http://hyperreal.org/~est/utf-8b/ )
>
> the utf-8b approach is superior to many others in that binary is
> preserved, but it does not inject control characters. instead it is an
> extension to utf-8 that allows all byte sequences, both those that are
> valid utf-8 and those that are not. when converting utf-8 <-> utf-16,
> the bytes in invalid utf-8 sequences <-> unpaired utf-16 surrogates.
> the correspondence is 1-1, so data is never lost. valid paired
> surrogates are unaffected (and are used for characters outside the
> bmp.)

this approach is perhaps reasonable for applications that want to use
utf-16 internally without corrupting invalid sequences in utf-8, but
it has problems too. for example it's not stable under string
concatenation or substring operations.

the whole reason utf-8 is usable comes from its self-synchronizing
property and the property that one character is never a substring of
another character. this necessarily forces the encoding to treat some
strings as invalid; that is, it's provably impossible to make an
encoding with the required properties where all strings are valid. as
a consequence, any treatment of invalid sequences as if they were
'special characters', like utf-8b does, will break all of the
essential properties. for some applications this may not matter; for
others it would be disastrous. it's certainly not possible to do such
a thing as the C library level (mb*towc family) without causing all
sorts of breakage.

my view is that it's best to just leave the data in its original utf-8
form and not do conversions until 'just in time', for presentation,
character identification, etc. caching this 'presentation' form
alongside the data may be appropriate for many applications.

rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Reply via email to