On Monday 04 March 2002 14:52, Jörg Walter wrote: > On Monday, 04. March 2002 19:59, Tod Harter wrote: > > > Ah, well, then. Why bother? Use wchar_t in C. Use 'my $foo' in perl. Use > 'string foo' in python. There you have your fundamental data type. If all > you're talking about is the API, then just leave the transcoding to others. > Perl does handle all that stuff for you already. C does, too (if you learn > the wcs stuff, and if it works, which I don't know). The API problem is NOT > '1 char != N bytes' but '1 char != 1 byte' - when programmers realize that > bytes are something different than chars, there is no problem anymore, > because then programmers think more abstract, in terms of chars instead of > bytes. The actual encoding is irrelevant then. And by the way, resistance > is futile.
My point is that programmers have NOT realized that, and it would be much simpler for them (and for a lot of existing code) to make the leap to 1 char = 4 bytes or 1 char = 2 bytes, than to "1 char = ? bytes. At some level data is bytes. Try as we might to pretend otherwise. Do a little bit of assembly programming and you will perhaps not wonder so much why I have this opinion. Assumptions about data representation tend to creep into systems at a wide variety of levels, and simpler representations will always be more reliable. UTF-8 is a more complex representation than any fixed width character representation, thats all my point is. > > > Yeah, actually I bet you any money thats exactly what we will end up > > with! Not the least because eventually people will implement a lot of > > what you call text manipulation in hardware, and I can pretty much > > guarantee you that silicon designers are NOT going to mess around with > > variable width character sets. > > They will have to. Talk of Sanskrit - you simply cannot encode this > language with one code point per glyph. The language is too complex. And > yes, even this has practical value, as I sometimes work for an indic > professor. And with the rise of the asian technology market, it will become > more and more important. And the demand for large quantities of high speed Sanskrit processing is pretty limited.... The same applies to historical texts. If a dedicated system for those languages is needed, then it will be a unique highly specialized system, and it will probably cost a LOT. Cheap high speed processing of common texts in hardware is coming, and I'd still bet that it will be done with fixed width charsets. > > I'm not talking about a text editor. I'm talking about a log file entry > with mixed content. I'm talking about a packet dump of a network protocol > for profiling or debugging. I'm talking about messing with proprietary data > formats (flash, word, whatever) to change stuff inside - though I dislike > that work, it is sometimes neccessary. And I am talking about a text editor > - the one you get when you are somewhere else and just want to take a quick > look at whats going on, the one that doesn't do UTF-8, let alone UTF-16. > Imagine why XML was made a text based format, not a binary one - because it > is comprehensible to the naked eye. So is UTF-8. And that just illustrates my point, that encoding dependencies tend to crop up in LOTS of places all the time. I argue that if the software engineering world had simply said to itself 10 years ago "OK, chars are now 32 bits" that it would have taken a LOT less time for systems to adapt. Obviously there are valid objections to that, I just tend to wonder if the right decision was made. I have this feeling that eventually, as storage and bandwidth become less relevant things will tend to focus on speed and reliability, and in that realm simple encodings are better. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
