"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for windows which uses UTF-16 is hardly a failure...

I really don't understand your hatred for UTF-8 - it's simple to decode and encode, fast and space-efficient. Fixed width encodings are not inherently fast, the only thing they are faster at is if you want to randomly access the Nth character instead of the Nth byte. In the rare cases that you need to do a lot of this kind of random access there exists UTF-32...

Any fixed width encoding which can encode every unicode character must use at least 3 bytes, and using 4 bytes is probably going to be faster because of alignment, so I don't see what the great improvement over UTF-32 is going to be.

slicing does require decoding
Nope.

I didn't mean that people are literally keeping code pages. I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.

Unicode doesn't have "language character sets". The different planes only exist for organisational purposes they don't affect how characters are encoded.

?! It's okay because you deem it "coherent in its scheme?" I deem headers much more coherent. :)

Sure if you change the word "coherent" to mean something completely different... Coherent means that you store related things together, ie. everything that you need to decode a character in the same place, not spread out between part of a character and a header.

but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do some transformation that depends on the code point such as converting case or identifying which character class a particular character belongs to. Appending, slicing, copying, searching, replacing, etc. basically all the most common text operations can all be done without any encoding or decoding.

Reply via email to