On Tuesday, 05. March 2002 15:58, Tod Harter wrote:

> My point is that programmers have NOT realized that, and it would be much

Why not? Nowadays, lots of programmers don't even know that 1 Char is (or once 
was) 1 Byte - look at them Point&Click Visual Abomination programmers. They 
don't need to know that, and that is good.

> simpler for them (and for a lot of existing code) to make the leap to 1

The lots of code you are speaking of are using an API. And even if they use 
strlen() in C, this is just a function call. The actual data structure can be 
assumed opaque for most of these programs.

> level data is bytes. Try as we might to pretend otherwise. Do a little bit
> of assembly programming and you will perhaps not wonder so much why I have

In assembler I usually shovel data around w/o caring what that data is. You 
usually don't need advanced text processing in assembler. And UTF-8 is rather 
well-designed for low-level processing - you have to check just one bit in 
order to determine character boundaries. This and much others is why UTF-8 is 
a great design within the boundaries of being ASCII-compatible.

> And the demand for large quantities of high speed Sanskrit processing is
> pretty limited.... The same applies to historical texts. If a dedicated

The demand for high speed text processing in general is limited. After all 
text processing implies that there is somebody who reads that text. If you 
are talking about large volume processing like it applies to a library or a 
web server with millions of users, your assumption that historical texts or 
complicated modern languages are irrelevant is plain ignorance. Exactly this 
is the purpose of Unicode: to encode anything you might come across so you 
don't have to worry about the charset. I doubt that 'performance' was one of 
the major design goals. M$ chose UCS-2 as encoding for performance reasons, 
but they, too, hit the need to support surrogate pairs.

> specialized system, and it will probably cost a LOT. Cheap high speed
> processing of common texts in hardware is coming, and I'd still bet that it
> will be done with fixed width charsets.

Hello? Doesn't this line run a bit different? After all, this is text we are 
talking about. Usually the words 'hardware' and 'video' occur close 
together... or even 'hardware' and 'IP'... but where would you need hardware 
text processing?

> world had simply said to itself 10 years ago "OK, chars are now 32 bits"
> that it would have taken a LOT less time for systems to adapt. Obviously

But they didn't. And at some point, even 32 bit chars will not suffice - these 
are earthlings we are talking about, they always get strange ideas. Maybe 
they'd encode composite chars one per glyph, without modifiers.
I believe that we could face this problem. Compatibility is usually a worthy 
goal, as every software system changes some day.

-- 
CU
        Joerg

PGP Public Key at http://ich.bin.kein.hoschi.de/~trouble/public_key.asc
PGP Key fingerprint = D34F 57C4 99D8 8F16 E16E  7779 CDDC 41A4 4C48 6F94


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to