On Fri, Feb 11, 2011 at 12:59:46PM +0100, Klaus Ethgen wrote: > Am Fr den 11. Feb 2011 um 10:37 schrieb Lars Wirzenius: > > The first Unicode standard was published in 1991. That's twenty years > > ago. Any software that processes text at all and is incapable of dealing > > with UTF-8 should be considered with extreme suspicion. Making all such > > bugs be release critical (which includes the notion that release > > managers may ignore the bug in particular cases) sounds like a good way > > to get things under control. > > I think you are mixing stuff together. First there is unicode. There are > several definitions for unicode (unicode-16, unicode-32, ...) but UTF-8 > is not unicode it is just one implementation of unicode and in my eyes > the most problematic as it has undefined states and is variable length.
There is just one definition of Unicode, any new versions merely add extra characters, collating rules, etc. There are several ways to represent Unicode as a stream of bytes. Only one of them is fit for external storage, and that's UTF-8 since it doesn't break the assumptions that are true for text files: 1. no null bytes 2. basic newlines, etc are always newlines, never a part of a bigger character (not true for some ancient multibyte encodings) 3. not affected by endianness or any other internal detail Also, _all_ Unicode encodings are of variable length. > However, UTF-8 was created to allow using unicode in non-unicode > environments. For me that was always a pointless plan and the unreadable > UTF-8 characters all around buggy software that cannot handle encodings > correct (and there are many around) and ignorant users who are using > UTF-8 in environments that are not specified for multibyte charsets > (IRC) is the most annoying one. UTF-8 was never meant as merely a tool to "allow using unicode in non-unicode environments". UTF-32 is useful only as an internal representation if you do care about a string of code points. Since a single character can consist of multiple such code points, it doesn't give you much unless you have to pass every code point through a function like wcwidth() -- ie, you are implementing something low-level which cares about properties of characters and their parts. You should never place UTF-32 into external storage that is not private to your program or can possibly be moved. UTF-16 is never, ever useful. It is a sad trap for win32 and Java developers, due to a bad engineering decision suggested, as I was told, by delegates from Microsoft and Sun, who wanted to "conserve disk space and memory" by storing separately code points and a language tag -- ie, exactly the thing Unicode was supposed to get us rid of. Even on day one, it was known that you can't fit all characters into 16 bits, and the decision to put all "rare characters" into a "private" area that needs out of band information was pretty ridiculous. The end result is, you have an encoding with all downsides of UTF-8 but none of the advantages. Since neither UTF-16 nor UTF-32 can be considered text, the decision all UNIX systems made was to use UTF-8 in the libc's API in all Unicode locales. Otherwise, you'd need separate APIs like FooBarA()/FooBarW() on Windows, which cause no end of problems. > So specifying to be UTF-8 capable is somewhat inconsequent. Software has > to be capable to handle every encoding as long as they are specified for > that encodings. No, there is only one encoding left, as long as you don't have to talk to Windows. We can start purging away all the support for ancient charsets in places that do not need to handle foreign data. Debian has used UTF-8 as default for 5 releases already, and if you try to use an ancient locale, do not expect good results since no one bothers fixing bugs there. And maintaining unused code costs time and causes a risk of bugs, so good riddance! -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211133612.ga2...@angband.pl