On Tuesday, 05. March 2002 15:58, Tod Harter wrote:
> My point is that programmers have NOT realized that, and it would be much
Why not? Nowadays, lots of programmers don't even know that 1 Char is (or once
was) 1 Byte - look at them Point&Click Visual Abomination programmers. They
don't need to know that, and that is good.
> simpler for them (and for a lot of existing code) to make the leap to 1
The lots of code you are speaking of are using an API. And even if they use
strlen() in C, this is just a function call. The actual data structure can be
assumed opaque for most of these programs.
> level data is bytes. Try as we might to pretend otherwise. Do a little bit
> of assembly programming and you will perhaps not wonder so much why I have
In assembler I usually shovel data around w/o caring what that data is. You
usually don't need advanced text processing in assembler. And UTF-8 is rather
well-designed for low-level processing - you have to check just one bit in
order to determine character boundaries. This and much others is why UTF-8 is
a great design within the boundaries of being ASCII-compatible.
> And the demand for large quantities of high speed Sanskrit processing is
> pretty limited.... The same applies to historical texts. If a dedicated
The demand for high speed text processing in general is limited. After all
text processing implies that there is somebody who reads that text. If you
are talking about large volume processing like it applies to a library or a
web server with millions of users, your assumption that historical texts or
complicated modern languages are irrelevant is plain ignorance. Exactly this
is the purpose of Unicode: to encode anything you might come across so you
don't have to worry about the charset. I doubt that 'performance' was one of
the major design goals. M$ chose UCS-2 as encoding for performance reasons,
but they, too, hit the need to support surrogate pairs.
> specialized system, and it will probably cost a LOT. Cheap high speed
> processing of common texts in hardware is coming, and I'd still bet that it
> will be done with fixed width charsets.
Hello? Doesn't this line run a bit different? After all, this is text we are
talking about. Usually the words 'hardware' and 'video' occur close
together... or even 'hardware' and 'IP'... but where would you need hardware
text processing?
> world had simply said to itself 10 years ago "OK, chars are now 32 bits"
> that it would have taken a LOT less time for systems to adapt. Obviously
But they didn't. And at some point, even 32 bit chars will not suffice - these
are earthlings we are talking about, they always get strange ideas. Maybe
they'd encode composite chars one per glyph, without modifiers.
I believe that we could face this problem. Compatibility is usually a worthy
goal, as every software system changes some day.
--
CU
Joerg
PGP Public Key at http://ich.bin.kein.hoschi.de/~trouble/public_key.asc
PGP Key fingerprint = D34F 57C4 99D8 8F16 E16E 7779 CDDC 41A4 4C48 6F94
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]