On Monday 04 March 2002 14:52, Jörg Walter wrote:
> On Monday, 04. March 2002 19:59, Tod Harter wrote:
> >
> Ah, well, then. Why bother? Use wchar_t in C. Use 'my $foo' in perl. Use
> 'string foo' in python. There you have your fundamental data type. If all
> you're talking about is the API, then just leave the transcoding to others.
> Perl does handle all that stuff for you already. C does, too (if you learn
> the wcs stuff, and if it works, which I don't know). The API problem is NOT
> '1 char != N bytes' but '1 char != 1 byte' - when programmers realize that
> bytes are something different than chars, there is no problem anymore,
> because then programmers think more abstract, in terms of chars instead of
> bytes. The actual encoding is irrelevant then. And by the way, resistance
> is futile.

My point is that programmers have NOT realized that, and it would be much 
simpler for them (and for a lot of existing code) to make the leap to 1 char 
= 4 bytes or 1 char = 2 bytes, than to "1 char = ? bytes. At some level data 
is bytes. Try as we might to pretend otherwise. Do a little bit of assembly 
programming and you will perhaps not wonder so much why I have this opinion. 
Assumptions about data representation tend to creep into systems at a wide 
variety of levels, and simpler representations will always be more reliable. 
UTF-8 is a more complex representation than any fixed width character 
representation, thats all my point is. 
>
> > Yeah, actually I bet you any money thats exactly what we will end up
> > with! Not the least because eventually people will implement a lot of
> > what you call text manipulation in hardware, and I can pretty much
> > guarantee you that silicon designers are NOT going to mess around with
> > variable width character sets.
>
> They will have to. Talk of Sanskrit - you simply cannot encode this
> language with one code point per glyph. The language is too complex. And
> yes, even this has practical value, as I sometimes work for an indic
> professor. And with the rise of the asian technology market, it will become
> more and more important.

And the demand for large quantities of high speed Sanskrit processing is 
pretty limited.... The same applies to historical texts. If a dedicated 
system for those languages is needed, then it will be a unique highly 
specialized system, and it will probably cost a LOT. Cheap high speed 
processing of common texts in hardware is coming, and I'd still bet that it 
will be done with fixed width charsets.
>
> I'm not talking about a text editor. I'm talking about a log file entry
> with mixed content. I'm talking about a packet dump of a network protocol
> for profiling or debugging. I'm talking about messing with proprietary data
> formats (flash, word, whatever) to change stuff inside - though I dislike
> that work, it is sometimes neccessary. And I am talking about a text editor
> - the one you get when you are somewhere else and just want to take a quick
> look at whats going on, the one that doesn't do UTF-8, let alone UTF-16.
> Imagine why XML was made a text based format, not a binary one - because it
> is comprehensible to the naked eye. So is UTF-8.

And that just illustrates my point, that encoding dependencies tend to crop 
up in LOTS of places all the time. I argue that if the software engineering 
world had simply said to itself 10 years ago "OK, chars are now 32 bits" that 
it would have taken a LOT less time for systems to adapt. Obviously there are 
valid objections to that, I just tend to wonder if the right decision was 
made. I have this feeling that eventually, as storage and bandwidth become 
less relevant things will tend to focus on speed and reliability, and in that 
realm simple encodings are better.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to