Re: default character encoding for everything in debian
Hi, (I want to see as much UTF-8 support. These days, it is not bad. Try using sed with UTF-8. It works! Of course with some understandable gliches.) On Mon, Aug 10, 2009 at 08:55:27PM +0200, Norbert Preining wrote: On Mo, 10 Aug 2009, Roger Leigh wrote: Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. Rubbish. You know why in Japan and other Asian countries UTF8 is not so common? Because many of their glyphs need 4 (four!) bytes, while for example jis-2022 (AFAIR) is much more compact. Hmmm... not the best example here, ... technically if you are talking size. We got too many encodings for Japanese. You see too many ESC code for jis-2022. We are not living in an ASCII world anymore. True. Our choice of encoding is not much to do with size. It is inertia and backward compatibility. FACTS: Many Japanese e-mail uses jis-2022 for compatibility. (E-mail was safe only for 7 bit data in old days). As far as data size goes, compact popular ones are EUC(Unix) or S-JIS(MS system). These are used in web pages etc. still. These are as small as UTF-16/UCS-2 used for many Unicode data internally. But please note new MAC and XP/Vista/... use Unicode and I see many files can be in UTF-8. So things are changing. Osamu -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Bastian Blank wrote: On Tue, Aug 11, 2009 at 09:40:35PM +0200, Bernd Eckenfels wrote: In article 20090811183800.ge5...@const.famille.thibault.fr you wrote: Not necessarily. Any sane implementation should just use wchar_t Which could be UTF16 and therefore still has complicatd length semantics. No, wchar_t is UCS-4 (or UCS-2 in esoteric implementations like Windows). No wchar_t is locale dependent (per POSIX). BTW on gcc: -fwide-exec-charset=charset Set the wide execution character set, used for wide string and character constants. The default is UTF-32 or UTF-16, whichever corresponds to the width of wchar_t. As with -fexec-charset, charset can be any encoding supported by the system's iconv library routine; however, you will have problems with encodings that do not fit exactly in wchar_t. Note that default encoding is UTF-8, thus giving a UTF-32 wchar_t in most developer machines. ciao cate -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Giacomo A. Catenazzi, le Wed 12 Aug 2009 07:54:33 +0200, a écrit : Samuel Thibault wrote: Gunnar Wolf, le Tue 11 Aug 2009 13:28:08 -0500, a écrit : while length(str) in any language up to the 1990s was a mere substraction, now we must go through the string checking each byte to see if it is a Unicode marker and substract the appropriate number of bytes. Not necessarily. Any sane implementation should just use wchar_t and substraction gets back. An implementation that use wchar_t is usually not sane, but usually it is (also) buggy. Why? It's just about using wide functions instead of usual functions. PS: note that the binary encoding depend on compiler environment (but such info is not exported). See my other mail. A lot of things can be made to depend on the compiler environment. Samuel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Giacomo A. Catenazzi, le Wed 12 Aug 2009 08:03:30 +0200, a écrit : Bastian Blank wrote: On Tue, Aug 11, 2009 at 09:40:35PM +0200, Bernd Eckenfels wrote: In article 20090811183800.ge5...@const.famille.thibault.fr you wrote: Not necessarily. Any sane implementation should just use wchar_t Which could be UTF16 and therefore still has complicatd length semantics. No, wchar_t is UCS-4 (or UCS-2 in esoteric implementations like Windows). No wchar_t is locale dependent (per POSIX). What do you mean? The compiler can't know the locale in advance for the width and endianness. The value might depend on the locale, yes, but that's not a problem as long as you convert into UTF-8 before communicating with other applications. One same systems (Debian systems are), it's just always UCS-4. BTW on gcc: -fwide-exec-charset=charset Set the wide execution character set, used for wide string and character constants. It hurts when I shoot myself in the foot. The default is UTF-32 or UTF-16, whichever corresponds to the width of wchar_t. This documentation is bogus BTW. It should read UCS-4 or UCS-2. Note that default encoding is UTF-8, thus giving a UTF-32 wchar_t in most developer machines. I don't understand this sentence. Samuel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Wed, Aug 12, 2009 at 09:56:49AM +0200, Samuel Thibault wrote: Giacomo A. Catenazzi, le Wed 12 Aug 2009 08:03:30 +0200, a écrit : Bastian Blank wrote: On Tue, Aug 11, 2009 at 09:40:35PM +0200, Bernd Eckenfels wrote: In article 20090811183800.ge5...@const.famille.thibault.fr you wrote: Not necessarily. Any sane implementation should just use wchar_t Which could be UTF16 and therefore still has complicatd length semantics. No, wchar_t is UCS-4 (or UCS-2 in esoteric implementations like Windows). No wchar_t is locale dependent (per POSIX). What do you mean? The compiler can't know the locale in advance for the width and endianness. The value might depend on the locale, yes, but that's not a problem as long as you convert into UTF-8 before communicating with other applications. One same systems (Debian systems are), it's just always UCS-4. Specifically, __STDC_ISO_10646__ is defined to indicate that wchar_t is always UCS-4 in all locales. BTW on gcc: -fwide-exec-charset=charset Set the wide execution character set, used for wide string and character constants. It hurts when I shoot myself in the foot. This feature of GCC is one of the more obscure areas of locale handling. How does the encoding of strings at the level of individial translation units work with a single per-process global locale and C formatted I/O? Curious minds would like to know! The default is UTF-32 or UTF-16, whichever corresponds to the width of wchar_t. This documentation is bogus BTW. It should read UCS-4 or UCS-2. It's strictly correct according to the standard. http://en.wikipedia.org/wiki/UTF-32/UCS-4 for an overview. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Wed, Aug 12, 2009 at 07:54:33AM +0200, Giacomo A. Catenazzi wrote: Samuel Thibault wrote: Gunnar Wolf, le Tue 11 Aug 2009 13:28:08 -0500, a écrit : while length(str) in any language up to the 1990s was a mere substraction, now we must go through the string checking each byte to see if it is a Unicode marker and substract the appropriate number of bytes. Not necessarily. Any sane implementation should just use wchar_t and substraction gets back. An implementation that use wchar_t is usually not sane, but usually it is (also) buggy. It is very difficult (AFAIK not impossible, but I'm not so sure) to write portable (POSIX way, so with changing locales) programs using wchar_t. Do you have any concrete examples to back up these assertions? They worked perfectly well for me last time I checked. There were bugs in the distant past, but I don't see any issues with current GCC/libc. BTW, since POSIX/SUS are a superset of the standard C library, they contain all of the same wide character handling functionality. I'm not sure what you're getting at with the changing locales; SUS locale functionality like setlocale() comes directly from C with no changes. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
It's impressing how quickly threads on this list grow big. :-) I'm not sure, whether a conclusion is already reached. 1. apt-get install mysql 2. enter mysql client 3. create database test; create table test( test char(10) ); Replace mysql with whatever application you like. What should be the encoding of database and table test in cases like the above? Currently it's iso-something, discriminating everybody from other countries. If it would be utf-8 instead, it would have at least two advantages - The clueless user would get a sane default - utf-8 isn't as discriminating as iso-8859-1 Best regards, Thomas Koch Hi, I've an issue, that I forgot to set the character encoding of tomcat to utf-8 after reinstalling a server. Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite to discuss) shouldn't utf8 be the default character set everywhere? So when installing a package from Debian I can assume that where a character encoding can be set, it't set to utf8. MySQL would be another example, which to my knowledge uses isoXYZ as default character encoding. Best regards, Thomas Koch, http://www.koch.ro Thomas Koch, http://www.koch.ro -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Wed, Aug 12, 2009 at 01:18:12PM +0200, Thomas Koch wrote: I'm not sure, whether a conclusion is already reached. 1. apt-get install mysql 2. enter mysql client 3. create database test; create table test( test char(10) ); Replace mysql with whatever application you like. What should be the encoding of database and table test in cases like the above? Currently it's iso-something, discriminating everybody from other countries. If it would be utf-8 instead, it would have at least two advantages - The clueless user would get a sane default - utf-8 isn't as discriminating as iso-8859-1 UTF-8 is the sane default choice in this situation, so long as MySQL is capable of handling it. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Roger Leigh, le Wed 12 Aug 2009 11:30:50 +0100, a écrit : The default is UTF-32 or UTF-16, whichever corresponds to the width of wchar_t. This documentation is bogus BTW. It should read UCS-4 or UCS-2. It's strictly correct according to the standard. http://en.wikipedia.org/wiki/UTF-32/UCS-4 for an overview. « except that the UTF-32 standard has additional Unicode semantics. » In UTF-32 mode, gcc introduces a BOM, and in UTF-16 it allows without warnings characters after U+. Samuel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Wed, 12 Aug 2009 13:03:30 +0100 Roger Leigh rle...@codelibre.net wrote: On Wed, Aug 12, 2009 at 01:18:12PM +0200, Thomas Koch wrote: I'm not sure, whether a conclusion is already reached. 1. apt-get install mysql 2. enter mysql client 3. create database test; create table test( test char(10) ); Replace mysql with whatever application you like. What should be the encoding of database and table test in cases like the above? Currently it's iso-something, discriminating everybody from other countries. If it would be utf-8 instead, it would have at least two advantages - The clueless user would get a sane default - utf-8 isn't as discriminating as iso-8859-1 UTF-8 is the sane default choice in this situation, so long as MySQL is capable of handling it. Is that a real problem? Usually applications that use a SQL DB come with some script to set up the schema. If they want UTF-8, they will create a table with UTF-8 encoding. I wouldn't change MySQL's default without reason, because old scripts might rely on that behaviour. Those applications, however, should be configured to use UTF-8 by default (if they support it) and their DB setup scripts accordingly. Cheers, harry signature.asc Description: PGP signature
Re: default character encoding for everything in debian
On Wed, Aug 12, 2009 at 11:44:36PM +0200, Harald Braumann wrote: On Wed, 12 Aug 2009 13:03:30 +0100 Roger Leigh rle...@codelibre.net wrote: On Wed, Aug 12, 2009 at 01:18:12PM +0200, Thomas Koch wrote: I'm not sure, whether a conclusion is already reached. 1. apt-get install mysql 2. enter mysql client 3. create database test; create table test( test char(10) ); Replace mysql with whatever application you like. What should be the encoding of database and table test in cases like the above? Currently it's iso-something, discriminating everybody from other countries. If it would be utf-8 instead, it would have at least two advantages - The clueless user would get a sane default - utf-8 isn't as discriminating as iso-8859-1 UTF-8 is the sane default choice in this situation, so long as MySQL is capable of handling it. Is that a real problem? Usually applications that use a SQL DB come with some script to set up the schema. If they want UTF-8, they will create a table with UTF-8 encoding. I wouldn't change MySQL's default without reason, because old scripts might rely on that behaviour. Those old scripts which don't specify an encoding *are already buggy* due to not saying what they want, implying that the default (whatever that might be) is fine. There's the possibility that this might cause some problems, but they are problems in the script, not in MySQL. Keeping using an obsolete encoding like Latin 1 (or whatever the default currently is) prevents any breakage, but at the expense of moving to a sane default for the future. Those applications, however, should be configured to use UTF-8 by default (if they support it) and their DB setup scripts accordingly. They should indeed, but if they don't then they need to explicitly spell out what they *do* support. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: default character encoding for everything in debian
On Thu, 13 Aug 2009 02:03:43 +0100 Roger Leigh rle...@codelibre.net wrote: On Wed, Aug 12, 2009 at 11:44:36PM +0200, Harald Braumann wrote: On Wed, 12 Aug 2009 13:03:30 +0100 Roger Leigh rle...@codelibre.net wrote: On Wed, Aug 12, 2009 at 01:18:12PM +0200, Thomas Koch wrote: I'm not sure, whether a conclusion is already reached. 1. apt-get install mysql 2. enter mysql client 3. create database test; create table test( test char(10) ); Replace mysql with whatever application you like. What should be the encoding of database and table test in cases like the above? Currently it's iso-something, discriminating everybody from other countries. If it would be utf-8 instead, it would have at least two advantages - The clueless user would get a sane default - utf-8 isn't as discriminating as iso-8859-1 UTF-8 is the sane default choice in this situation, so long as MySQL is capable of handling it. Is that a real problem? Usually applications that use a SQL DB come with some script to set up the schema. If they want UTF-8, they will create a table with UTF-8 encoding. I wouldn't change MySQL's default without reason, because old scripts might rely on that behaviour. Those old scripts which don't specify an encoding *are already buggy* due to not saying what they want, implying that the default (whatever that might be) is fine. Agreed. Still no need to break them on purpose. There's the possibility that this might cause some problems, but they are problems in the script, not in MySQL. Keeping using an obsolete encoding like Latin 1 (or whatever the default currently is) prevents any breakage, but at the expense of moving to a sane default for the future. I really don't care too much about the specific case of MySQL, as I hardly ever create or manipulate SQL data by hand. All I was saying and you seem to be saying as well, if I understand you correctly, is that it is the duty of the application that creates and uses SQL tables to specify the encoding, if it cares about it. If the application does that, it will work, no matter what default is specified for MySQL. So this specific case is a non-issue, IMO, and MySQL's default doesn't need to be changed. But if it is, just for the sake of it, then that's fine with me. Some scripts might break, but OK. Those applications, however, should be configured to use UTF-8 by default (if they support it) and their DB setup scripts accordingly. They should indeed, but if they don't then they need to explicitly spell out what they *do* support. The should. Cheers, harry signature.asc Description: PGP signature
Re: default character encoding for everything in debian
Norbert Preining dijo [Mon, Aug 10, 2009 at 08:55:27PM +0200]: On Mo, 10 Aug 2009, Roger Leigh wrote: Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. Rubbish. You know why in Japan and other Asian countries UTF8 is not so common? Because many of their glyphs need 4 (four!) bytes, while for example jis-2022 (AFAIR) is much more compact. We are not living in an ASCII world anymore. It's not that much about the size as it is about backwards compatibility. We users of Latin-based alphabets migrate easily to UTF8, with occassional problems where we use diacritics. Eastern Asian encodings are _completely_ incompatible with UTF8, so it is just not possible to tolerate broken text every now and then. Everything just breaks completely. -- Gunnar Wolf • gw...@gwolf.org • (+52-55)5623-0154 / 1451-2244 -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Harald Braumann dijo [Tue, Aug 11, 2009 at 01:33:58AM +0200]: There are a lot of users out there that are not willing to pay the price for increased generality. Don't you mean s/users/programmers? As a user I don't see what price I pay. I only see advantages in having a consistent encoding. Which, btw., doesn't have to be UTF-8. In an ideal world every programme would adhere to LC_CTYPE. But if the encoding has to be configured then I would also prefer UTF-8 as the default. Of course, for the programmer there might be a price to pay. And if he's not willing to pay it, he can't be forced, anyway. Or do you mean the user pays the price, because if the encoding is set to UTF-8 then performance would suffer? In that case, I'd love to see some real life numbers. I doubt the difference would be noticeable. Yes, performance will suffer. We enjoyed many decades of blissfully ignoring the difference between a character and a byte. So, while length(str) in any language up to the 1990s was a mere substraction, now we must go through the string checking each byte to see if it is a Unicode marker and substract the appropriate number of bytes. Also, for a very long time we didn't really care much what was a buffer's content - Everything could be printed, even if it had control characters which made you beep (with the ocassional control sequence re-injecting output into the terminal as input). Now... Well, printing an unprintable string can cause segfaults in some cases. -- Gunnar Wolf • gw...@gwolf.org • (+52-55)5623-0154 / 1451-2244 -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Gunnar Wolf, le Tue 11 Aug 2009 13:28:08 -0500, a écrit : while length(str) in any language up to the 1990s was a mere substraction, now we must go through the string checking each byte to see if it is a Unicode marker and substract the appropriate number of bytes. Not necessarily. Any sane implementation should just use wchar_t and substraction gets back. The width of the text is another matter, but it's a problem for truetype rendering anyway. What is still costly is then the conversion, which in principle only happens while talking with other programs (files/socket/etc.) Samuel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
In article 20090811182041.gd19...@cajita.gateway.2wire.net you wrote: encodings are _completely_ incompatible with UTF8, so it is just not possible to tolerate broken text every now and then. Everything just breaks completely. Or everything works out of the box, when you use it correctly... Gruss Bernd -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
In article 20090811183800.ge5...@const.famille.thibault.fr you wrote: Not necessarily. Any sane implementation should just use wchar_t Which could be UTF16 and therefore still has complicatd length semantics. And even with UTF32 there are combining characters. Sadly. But the length could be defined in code units - its just a question how usefull it is. Gruss Bernd -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Tue, Aug 11, 2009 at 09:40:35PM +0200, Bernd Eckenfels wrote: In article 20090811183800.ge5...@const.famille.thibault.fr you wrote: Not necessarily. Any sane implementation should just use wchar_t Which could be UTF16 and therefore still has complicatd length semantics. No, wchar_t is UCS-4 (or UCS-2 in esoteric implementations like Windows). Bastian -- Phasers locked on target, Captain. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Bernd Eckenfels, le Tue 11 Aug 2009 21:40:35 +0200, a écrit : In article 20090811183800.ge5...@const.famille.thibault.fr you wrote: Not necessarily. Any sane implementation should just use wchar_t Which could be UTF16 and therefore still has complicatd length semantics. ?? wchar_t may be 32 or 16bit (in which case it can't express unicode after U+), but it's still meant to have the simple length semantics. And even with UTF32 there are combining characters. Which account for one character. Then there is a problem of rendering width of course, but as I said it's there anyway as soon as you have a font with varying letter widths, string manipulation don't pose any problem anyway. But the length could be defined in code units - its just a question how usefull it is. Of course. It's rarely useful to take into account character width yourself, unless you are rendering on a tty, but then speed usually doesn't matter and you can afford calling wcswidth() on your string as late as possible. Samuel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
* Bastian Blank wa...@debian.org, 2009-08-11, 22:24: Not necessarily. Any sane implementation should just use wchar_t Which could be UTF16 and therefore still has complicatd length semantics. No, wchar_t is UCS-4 (or UCS-2 in esoteric implementations like Windows). And in the most esoteric (while still conforming to the C standard) implementations it is not related to Unicode at all. -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Mon, Aug 10, 2009 at 09:04:37PM +0100, Roger Leigh wrote: If having a C.UTF-8 locale always available for system services is required for them to fully support UTF-8, then that needs adding to glibc. It would also bring significant speed increase. Since about everything calls setlocale(), having the locale internal speeds up the typical process startup sequence by 20%! And that's 20% of the whole thing from fork(), through link, up to getopt(), so it's not a speedup you can shake a stick at. I'm speaking about having the locale supported natively by glibc, of course; what the udeb does is merely shipping a generated locale file. For a locale available after /usr is mounted, a simple localedef invocation is all that's needed; for all times, after starting init, it needs the tables compiling into glibc as for the standard C locale. I've been looking at how to do the latter, but I'm not expert with the 3-level locale tables and other glibc internals, so if anyone who knows the details of glibc locales could provide me with assistance/guidance here, that would be much appreciated. For reference, this is bug #522776. This would be great to have as a release goal for Squeeze, and (speculatively) a native C UTF-8 locale for Squeeze+1 to give us a default pure UTF-8 system from end-to-end. I'm not an expert with glibc internals too, but a couple of years ago I researched the issue a bit. Apparently, there are only two first-class locales: C and POSIX, all other get loaded from the disk. In the past, en_US.ISO-8859-1 and ru_RU.KOI8-R were such first-class ones as well, but that's no more. What I'd propose would be making C.UTF-8 built in. Another possible optimization would be building the table used by 8-bit isalpha/etc on the fly for all locales. Iconving 128 characters is certainly faster than opening a file on the disk, and (sanely) glibc doesn't support character classification contrary to Unicode so this could result in completely nuking all LC_CTYPE files for other locales as well. -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Tue, 11 Aug 2009 13:28:08 -0500 Gunnar Wolf gw...@gwolf.org wrote: Harald Braumann dijo [Tue, Aug 11, 2009 at 01:33:58AM +0200]: There are a lot of users out there that are not willing to pay the price for increased generality. Don't you mean s/users/programmers? As a user I don't see what price I pay. I only see advantages in having a consistent encoding. Which, btw., doesn't have to be UTF-8. In an ideal world every programme would adhere to LC_CTYPE. But if the encoding has to be configured then I would also prefer UTF-8 as the default. Of course, for the programmer there might be a price to pay. And if he's not willing to pay it, he can't be forced, anyway. Or do you mean the user pays the price, because if the encoding is set to UTF-8 then performance would suffer? In that case, I'd love to see some real life numbers. I doubt the difference would be noticeable. Yes, performance will suffer. We enjoyed many decades of blissfully ignoring the difference between a character and a byte. Well, a byte with the most significant bit always set to 0. So, while length(str) in any language up to the 1990s was a mere substraction, now we must go through the string checking each byte to see if it is a Unicode marker and substract the appropriate number of bytes. Also, for a very long time we didn't really care much what was a buffer's content - And in these glorious times more often than not unintelligible rubbish was produced if you happened to not use a language that can be written in ASCII. But this is besides the point. I do appreciate that support for different character encodings causes pain for the programmer. But the original post was about software that already has got support for UTF-8 and whether it wouldn't be good idea to configure it this way by default. Everything could be printed, even if it had control characters which made you beep (with the ocassional control sequence re-injecting output into the terminal as input). Now... Well, printing an unprintable string can cause segfaults in some cases. My terminal supports UTF-8. I thought that this is not an issue any more. Cheers, harry signature.asc Description: PGP signature
Re: default character encoding for everything in debian
Samuel Thibault wrote: Gunnar Wolf, le Tue 11 Aug 2009 13:28:08 -0500, a écrit : while length(str) in any language up to the 1990s was a mere substraction, now we must go through the string checking each byte to see if it is a Unicode marker and substract the appropriate number of bytes. Not necessarily. Any sane implementation should just use wchar_t and substraction gets back. An implementation that use wchar_t is usually not sane, but usually it is (also) buggy. It is very difficult (AFAIK not impossible, but I'm not so sure) to write portable (POSIX way, so with changing locales) programs using wchar_t. The only way I know is to use sanely the wchar_t is to use as the simple C standard requirements: only one runtime environment and locale. PS: note that the binary encoding depend on compiler environment (but such info is not exported). ciao cate -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Mon, Aug 10, 2009 at 13:09 +0200, Thomas Koch wrote: Hi, I've an issue, that I forgot to set the character encoding of tomcat to utf-8 after reinstalling a server. Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite to discuss) shouldn't utf8 be the default character set everywhere? So when installing a package from Debian I can assume that where a character encoding can be set, it't set to utf8. MySQL would be another example, which to my knowledge uses isoXYZ as default character encoding. While utf-8 covers the broadest set of character glyphs possible, it suffers from size as well as performance penalties. Characters no longer are guaranteed to fit in a byte, how do you define strlen(utf8_string) c pp. All these issues have been solved but not for free. There are a lot of users out there that are not willing to pay the price for increased generality. just my 2¢ Siggy -- Please don't Cc: me when replying, I might not see either copy. bsb-at-psycho-dot-informationsanarchistik-dot-de or:bsb-at-psycho-dot-i21k-dot-de O ascii ribbon campaign - stop html mail - www.asciiribbon.org signature.asc Description: Digital signature
Re: default character encoding for everything in debian
Thomas Koch wrote: Hi, I've an issue, that I forgot to set the character encoding of tomcat to utf-8 after reinstalling a server. Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite to discuss) shouldn't utf8 be the default character set everywhere? So when installing a package from Debian I can assume that where a character encoding can be set, it't set to utf8. MySQL would be another example, which to my knowledge uses isoXYZ as default character encoding. There are different problems. Future debian systems will have a UTF-8 charset as default. Look at debian-policy archives. A lot of debian files will be encoded in utf-8 (control, changelog and manpages), and transformed in the needed charset runtime. But for databases there are different issues. I think the best solution is to do it as mediawiki: the UTF-8 data in put as binary blob: it is difficult to have database engines and system libraries syncronized, and it is also difficult to implement support for all Unicode characters. But let to concentrate to the first task: having a good UTF-8 support in all programs/terminals/etc. ciao cate -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
Hi Dne Mon, 10 Aug 2009 13:09:21 +0200 Thomas Koch tho...@koch.ro napsal(a): I've an issue, that I forgot to set the character encoding of tomcat to utf-8 after reinstalling a server. Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite to discuss) shouldn't utf8 be the default character set everywhere? So when installing a package from Debian I can assume that where a character encoding can be set, it't set to utf8. MySQL would be another example, which to my knowledge uses isoXYZ as default character encoding. I don't know tomcat, but for MySQL it would definitely break some existing applications (which are broken and do not care about charsets, but that's different topic). -- Michal Čihař | http://cihar.com | http://blog.cihar.com signature.asc Description: PGP signature
Re: default character encoding for everything in debian
Le lundi 10 août 2009 à 14:06 +0200, Giacomo A. Catenazzi a écrit : But let to concentrate to the first task: having a good UTF-8 support in all programs/terminals/etc. This task should have been completed for etch. Now we could concentrate on removing from the archive programs without proper UTF8 support. Cheers, -- .''`. Josselin Mouette : :' : `. `' “I recommend you to learn English in hope that you in `- future understand things” -- Jörg Schilling signature.asc Description: Ceci est une partie de message numériquement signée
Re: default character encoding for everything in debian
Josselin Mouette j...@debian.org writes: Now we could concentrate on removing from the archive programs without proper UTF8 support. There are, sadly, some very useful programs with no adequate replacement that don't have UTF-8 support. tf5, for instance. -- Russ Allbery (r...@debian.org) http://www.eyrie.org/~eagle/ -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Mon, Aug 10, 2009 at 01:45:40PM +0200, Siggy Brentrup wrote: On Mon, Aug 10, 2009 at 13:09 +0200, Thomas Koch wrote: Hi, I've an issue, that I forgot to set the character encoding of tomcat to utf-8 after reinstalling a server. Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite to discuss) shouldn't utf8 be the default character set everywhere? So when installing a package from Debian I can assume that where a character encoding can be set, it't set to utf8. MySQL would be another example, which to my knowledge uses isoXYZ as default character encoding. While utf-8 covers the broadest set of character glyphs possible, it suffers from size as well as performance penalties. Characters no longer are guaranteed to fit in a byte, how do you define strlen(utf8_string) c pp. All these issues have been solved but not for free. Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. There are a lot of users out there that are not willing to pay the price for increased generality. These users will need to change their character encoding to something else. But the Debian default should remain UTF-8. Those not willing to pay the flexibility/performance tradeoff are the exception, and will need to customise their environment accordingly. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: default character encoding for everything in debian
On Mo, 10 Aug 2009, Roger Leigh wrote: Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. Rubbish. You know why in Japan and other Asian countries UTF8 is not so common? Because many of their glyphs need 4 (four!) bytes, while for example jis-2022 (AFAIR) is much more compact. We are not living in an ASCII world anymore. Best wishes Norbert --- Dr. Norbert Preining prein...@logic.atVienna University of Technology Debian Developer prein...@debian.org Debian TeX Group gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 --- CHICAGO (n.) The foul-smelling wind which precedes an underground railway train. --- Douglas Adams, The Meaning of Liff -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On 2009-08-10, Norbert Preining prein...@logic.at wrote: On Mo, 10 Aug 2009, Roger Leigh wrote: Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. Rubbish. You know why in Japan and other Asian countries UTF8 is not so common? Because many of their glyphs need 4 (four!) bytes, while for example jis-2022 (AFAIR) is much more compact. We are not living in an ASCII world anymore. Really because of the size? We are not living in a byte beancounting world anymore. At worst you double the *text* size (we're not talking about images or anything, which are far larger), going from 2 bytes that you need anyway to four. ISO 2022 also wastes one bit per byte to be 7bit safe. If I read the Wikipedia article correctly at least the JP escaping only needs to be put into the document once. (Well, or maybe several times switching back and forth if you're embedding latin-encoded words into the text.) Maybe I'm an ignorant European but I'm not sure that equation still holds. Of course there are certain tradeoffs about latin characters being the privileged few to get a short encoding, but that doesn't make UTF-8 bad per se to call it rubbish. Kind regards, Philipp Kern -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Mon, Aug 10, 2009 at 19:53 +0100, Roger Leigh wrote: On Mon, Aug 10, 2009 at 01:45:40PM +0200, Siggy Brentrup wrote: While utf-8 covers the broadest set of character glyphs possible, it suffers from size as well as performance penalties. Characters no longer are guaranteed to fit in a byte, how do you define strlen(utf8_string) c pp. All these issues have been solved but not for free. Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. It's not Huffman (just kidding :), stating the obvious you're trading time efficiency for space efficiency. There are a lot of users out there that are not willing to pay the price for increased generality. These users will need to change their character encoding to something else. But the Debian default should remain UTF-8. Those not willing to pay the flexibility/performance tradeoff are the exception, and will need to customise their environment accordingly. Either my memory is wrong or I seem to have missed some fundamental change in Debian Policy during my 5 year of absence. From those days I seem to remember that Debian supported use of low end machines in the past while they seem to be deprecated now as I was told in another thread on d-u iirc. Call me a dinosaur, I'm not yet decided how to think about this. Regards Siggy -- Please don't Cc: me when replying, I might not see either copy. bsb-at-psycho-dot-informationsanarchistik-dot-de or:bsb-at-psycho-dot-i21k-dot-de O ascii ribbon campaign - stop html mail - www.asciiribbon.org signature.asc Description: Digital signature
Re: default character encoding for everything in debian
On Mo, 10 Aug 2009, Philipp Kern wrote: Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. [...] make UTF-8 bad per se to call it rubbish. I didn't call utf-8 itself rubbish, I am myself a strong proponent for utf-8, only your quote that it is about as compact as an extended encoding is going to get. OTOH, I agree that UTF-8 is the way to go in general computing, I have had too much pain with all those local encodings around the world. Best wishes Norbert --- Dr. Norbert Preining prein...@logic.atVienna University of Technology Debian Developer prein...@debian.org Debian TeX Group gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 --- HUTTOFT (n.) The fibrous algae which grows in the dark, moist environment of trouser turn-ups. --- Douglas Adams, The Meaning of Liff -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Re: default character encoding for everything in debian
On Mon, Aug 10, 2009 at 02:06:44PM +0200, Giacomo A. Catenazzi wrote: Thomas Koch wrote: I've an issue, that I forgot to set the character encoding of tomcat to utf-8 after reinstalling a server. Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite to discuss) shouldn't utf8 be the default character set everywhere? So when installing a package from Debian I can assume that where a character encoding can be set, it't set to utf8. MySQL would be another example, which to my knowledge uses isoXYZ as default character encoding. Future debian systems will have a UTF-8 charset as default. Look at debian-policy archives. For system users, yes, assuming you are talking about the C.UTF-8 proposal. For normal users, UTF-8 has been the default since Lenny. If having a C.UTF-8 locale always available for system services is required for them to fully support UTF-8, then that needs adding to glibc. For a locale available after /usr is mounted, a simple localedef invocation is all that's needed; for all times, after starting init, it needs the tables compiling into glibc as for the standard C locale. I've been looking at how to do the latter, but I'm not expert with the 3-level locale tables and other glibc internals, so if anyone who knows the details of glibc locales could provide me with assistance/guidance here, that would be much appreciated. For reference, this is bug #522776. This would be great to have as a release goal for Squeeze, and (speculatively) a native C UTF-8 locale for Squeeze+1 to give us a default pure UTF-8 system from end-to-end. A lot of debian files will be encoded in utf-8 (control, changelog and manpages), and transformed in the needed charset runtime. I think will here implies it's something to be done in the future, but it's a requirement right now, and all but a few exceptions are already converted. But for databases there are different issues. I think the best solution is to do it as mediawiki: the UTF-8 data in put as binary blob: it is difficult to have database engines and system libraries syncronized, and it is also difficult to implement support for all Unicode characters. PostgreSQL seems to manage it without problems. Putting text in as a binary blob obviates most uses for having in a database in the first place. Sorting, indexing and querying requires being able to read it! Note that there are separate client and server (database) encodings for text as well. You may well get recoding between what the user sees and what's actually stored in the database, potentially at several points. Having UTF-8 on the server does not require it on the client (and vice versa). But let to concentrate to the first task: having a good UTF-8 support in all programs/terminals/etc. I think that part was already done quite some time ago. Any program that doesn't support UTF-8 is an exception, and should be fixed or removed. For the specific case of databases, what's being proposed here is making the default UTF-8. Existing databases should not be affected, since they would retain their current encoding. New databases should, however, use UTF-8. If a specific application needs a specific encoding in order to function correctly, then it's that application's responsibility to specify that when creating it i.e. overriding the default. If it doesn't do that already, it's already broken since it's currently unspecified. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: default character encoding for everything in debian
On Mon, Aug 10, 2009 at 09:49:34PM +0200, Norbert Preining wrote: On Mo, 10 Aug 2009, Philipp Kern wrote: Of course there's a penalty for certain operations. But UTF-8 is about as compact as an extended encoding is going to get. [...] make UTF-8 bad per se to call it rubbish. I didn't call utf-8 itself rubbish, I am myself a strong proponent for utf-8, only your quote that it is about as compact as an extended encoding is going to get. I should have qualified it with that is both 8-bit and backward- compatible with ASCII. Other encodings will be more compact, but AFAIK there isn't a more compact UCS encoding, though UTF-16 /might/ be more compact for certain languages, albeit without any 8-bit backward compatibility. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `-GPG Public Key: 0x25BFB848 Please GPG sign your mail. signature.asc Description: Digital signature
Re: default character encoding for everything in debian
On Mon, Aug 10, 2009 at 09:42:18PM +0100, Roger Leigh wrote: On Mon, Aug 10, 2009 at 09:49:34PM +0200, Norbert Preining wrote: I didn't call utf-8 itself rubbish, I am myself a strong proponent for utf-8, only your quote that it is about as compact as an extended encoding is going to get. I should have qualified it with that is both 8-bit and backward- compatible with ASCII. Other encodings will be more compact, but AFAIK there isn't a more compact UCS encoding, though UTF-16 /might/ be more compact for certain languages, albeit without any 8-bit backward compatibility. Actually, SCSU and BOCU-1 are potentially more compact, assuming the text can be compressed. However, they are not backward-compatible with ASCII; SCSU comes closer than BOCU-1. As a practical matter, nobody of any importance actually uses SCSU or BOCU-1, except for Reuters (with SCSU). -- brian m. carlson / brian with sandals: Houston, Texas, US +1 713 440 7475 | http://crustytoothpaste.ath.cx/~bmc | My opinion only OpenPGP: RSA v4 4096b 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187 signature.asc Description: Digital signature
Re: default character encoding for everything in debian
On Mon, 10 Aug 2009 13:45:40 +0200 Siggy Brentrup deb...@psycho.i21k.de wrote: On Mon, Aug 10, 2009 at 13:09 +0200, Thomas Koch wrote: Hi, I've an issue, that I forgot to set the character encoding of tomcat to utf-8 after reinstalling a server. Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite to discuss) shouldn't utf8 be the default character set everywhere? So when installing a package from Debian I can assume that where a character encoding can be set, it't set to utf8. MySQL would be another example, which to my knowledge uses isoXYZ as default character encoding. While utf-8 covers the broadest set of character glyphs possible, it suffers from size as well as performance penalties. Characters no longer are guaranteed to fit in a byte, how do you define strlen(utf8_string) c pp. All these issues have been solved but not for free. There are a lot of users out there that are not willing to pay the price for increased generality. Don't you mean s/users/programmers? As a user I don't see what price I pay. I only see advantages in having a consistent encoding. Which, btw., doesn't have to be UTF-8. In an ideal world every programme would adhere to LC_CTYPE. But if the encoding has to be configured then I would also prefer UTF-8 as the default. Of course, for the programmer there might be a price to pay. And if he's not willing to pay it, he can't be forced, anyway. Or do you mean the user pays the price, because if the encoding is set to UTF-8 then performance would suffer? In that case, I'd love to see some real life numbers. I doubt the difference would be noticeable. Cheers, harry signature.asc Description: PGP signature
Re: default character encoding for everything in debian
Harald Braumann, le Tue 11 Aug 2009 01:33:58 +0200, a écrit : Or do you mean the user pays the price, because if the encoding is set to UTF-8 then performance would suffer? In that case, I'd love to see some real life numbers. I doubt the difference would be noticeable. Google utf-8 grep performance loss. Samuel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org