On Monday 04 March 2002 05:05, Jörg Walter wrote:
> On Sunday, 03. March 2002 21:21, Tod Harter wrote:
> > encode the resulting paths. Realistically even UTF-8 is a hack. ALL the
> > software and standards need to be updated, badly. Ideally all software
> > should be able to deal with any incoming encoding, and really everything
> > should be UTF-16 internally. At least then you have a fighting chance of
> > representing an encoding in a consistent internal form. I'd give it about
> > 40 years...
>
> So UTF-8 is a hack, but UTF-16 not - what makes you think so? Perhaps you
> are uncomfortable with the idea of characters with differing byte length?

It is MUCH easier to write low level code that deals with fixed width chars, 
yes. This is fundamentally why i18n is such a big deal is that its a LOT 
harder to reliably write algorithms that have to deal with all sorts of 
different ways of representing data! One standard fundamental data type, ie 
short, int, long, whatever you want to call it makes things MUCH easier. 
Trust me, I've been writing software for 20+ years, I know from personal 
experience! 

> Then I must disappoint you, UTF-16 has surrogate pairs for characters
> beyond plane 0 (the first 65k), and there are already alphabets located in
> plane 1 (a 'Lord of the Rings' book would use it :-).

I actually do know about UCS etc etc etc ;o). I suspect the world will not 
really ever be too concerned with the data processing of Elvish, or Klingon...

> Actually, UTF-16 is _not_ able to encode the whole 32-bit code space, while
> UTF-8 is. (The practical value of this is rather doubtful, though, as there
> are proposals to clip the 32-bit code space to 21 bit as it ought to be
> enough for every char. And UTF-16 does these 21 bits.)

Exactly, it has zero practical value.

> So you should rather use UCS-4 for a true one-to-one length-to-character
> mapping. But wait, then there are composite chars, modifiers, and the
> famous BOM (zero width space), which doesn't count, or at least not really.
> You'll never get your good old bytes=chars behaviour back.

Yeah, actually I bet you any money thats exactly what we will end up with! 
Not the least because eventually people will implement a lot of what you call 
text manipulation in hardware, and I can pretty much guarantee you that 
silicon designers are NOT going to mess around with variable width character 
sets.

>
> And at the end, it is absolutely irrelevant how the stuff is stored
> internally. For transfer, UTF-8 is a great thing, a lot more comprehensible
> to the casual eye than UTF-16.

Uh, "comprehensible to the casual eye"? Its comprehensible because your text 
editor understands UTF-8!!! If it understood only UTF-16 then THAT would be 
"comprehensible to the casual eye". I guarantee you can't tell a memory 
location thats on from off with the naked eye, they're about .1 microns 
across..... ;lo)

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to