yes, i agree. the utf-8b approach is useful mainly when sending binary data through a utf-16 channel with the hope of recovering it at the far side. once byte string or character string manipulations are performed, all bets are off.
On 4/27/07, Rich Felker <[EMAIL PROTECTED]> wrote:
On Fri, Apr 27, 2007 at 12:41:22PM -0700, Ben Wiley Sittler wrote: > glad it was rejected. the only really sensible approach i have yet > seen is utf-8b (see my take on it here: > http://bsittler.livejournal.com/10381.html and another implementation > here: http://hyperreal.org/~est/utf-8b/ ) > > the utf-8b approach is superior to many others in that binary is > preserved, but it does not inject control characters. instead it is an > extension to utf-8 that allows all byte sequences, both those that are > valid utf-8 and those that are not. when converting utf-8 <-> utf-16, > the bytes in invalid utf-8 sequences <-> unpaired utf-16 surrogates. > the correspondence is 1-1, so data is never lost. valid paired > surrogates are unaffected (and are used for characters outside the > bmp.) this approach is perhaps reasonable for applications that want to use utf-16 internally without corrupting invalid sequences in utf-8, but it has problems too. for example it's not stable under string concatenation or substring operations. the whole reason utf-8 is usable comes from its self-synchronizing property and the property that one character is never a substring of another character. this necessarily forces the encoding to treat some strings as invalid; that is, it's provably impossible to make an encoding with the required properties where all strings are valid. as a consequence, any treatment of invalid sequences as if they were 'special characters', like utf-8b does, will break all of the essential properties. for some applications this may not matter; for others it would be disastrous. it's certainly not possible to do such a thing as the C library level (mb*towc family) without causing all sorts of breakage. my view is that it's best to just leave the data in its original utf-8 form and not do conversions until 'just in time', for presentation, character identification, etc. caching this 'presentation' form alongside the data may be appropriate for many applications. rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/