Re: OT: Python (was: Make Unicode bugs release critical?)
On 2011-02-14 16:43:11 +, Ian Jackson wrote: When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. So, cat, grep, etc. are all broken. :) -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110216000107.gl15...@prunille.vinc17.org
Re: OT: Python (was: Make Unicode bugs release critical?)
On Wed, Feb 16, 2011 at 01:01:07AM +0100, Vincent Lefevre wrote: On 2011-02-14 16:43:11 +, Ian Jackson wrote: When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. So, cat, grep, etc. are all broken. :) How come? cat will, for any valid UTF-8 character on input, print a valid UTF-8 character on output. For any valid ISO-8859-1 character on input, it will print a valid ISO-8859-1 character on output. grep on the other hand has to actually understand the encoding -- and it does. Try this: $ echo ą|LC_CTYPE=C grep --color=always . Will be mangled. $ echo ą|LC_CTYPE=en_US.utf-8 grep --color=always . Will be handled correctly. -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110216003451.ga14...@angband.pl
Re: OT: Python (was: Make Unicode bugs release critical?)
On 2011-02-16 01:34:51 +0100, Adam Borowski wrote: On Wed, Feb 16, 2011 at 01:01:07AM +0100, Vincent Lefevre wrote: On 2011-02-14 16:43:11 +, Ian Jackson wrote: When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. So, cat, grep, etc. are all broken. :) How come? cat will, for any valid UTF-8 character on input, print a valid UTF-8 character on output. For any valid ISO-8859-1 character on input, it will print a valid ISO-8859-1 character on output. I was just commenting what Ian said. If there is a valid reason for which cat may not produce UTF-8 in UTF-8 locales, this is also true for perl or any other software. -- Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/ 100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/ Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110216004529.gn15...@prunille.vinc17.org
OT: Python (was: Make Unicode bugs release critical?)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi, lets start a python rant. I love to hate this language. :-) Am Mo den 14. Feb 2011 um 14:14 schrieb Jakub Wilk: $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' unicode pound sign [...] $ LC_CTYPE=en_GB.utf-8 python -c 'print u\u00a3' | cat Traceback (most recent call last): File string, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) This is the expected behaviour. Incidentally, it has nothing to do with UTF-8. You'll get the same result if you use a locale with a legacy encoding. I see. It is funny to see python lovers to blame other for the bugs in the language. ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Both gives the same result, a '£' sign as expected. * Ian Jackson ijack...@chiark.greenend.org.uk, 2011-02-14, 12:42: Excellent, I look forward to the removal of python. I always hated that language anyway. I hate them more. :-) Regards Klaus - -- Klaus Ethgenhttp://www.ethgen.ch/ pub 2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de Fingerprint: D7 67 71 C4 99 A6 D4 FE EA 40 30 57 3C 88 26 2B -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQEVAwUBTVkwIJ+OKpjRpO3lAQr9qAf+I4UXXNKso2hhr6BEjgn/o0IOpbI6/jhe YwSf5rysUlb924NvtdOc1VzLoOff/uUDXOpW0VICSJMZRfVLZvVvdwaysa+SJj/f 0UL0CnuHogtan5uV627JFQRI5/VpQ9LXRc7w6w0+Eh8d7Pm/FJYomI4fuGAM0jPo n1mFCeHSP2PiSIJ85cKWCqxsDkC4EDrPvrqol2ZJfuW1bVqqViGWMIrQ8RXzQ8JD eSBHY0qjOCoMz1W46C4ruk3SVkX6FGe/V9U6XUG9kcAYlfpMyfeHDQ207P1tuEUH dmD9gFA8ZpUgxHSZY43ONBnJlFynubPv7bmWoic7sez6V8zab6TFqg== =KrXl -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214133736.gb6...@ikki.ethgen.ch
Re: OT: Python (was: Make Unicode bugs release critical?)
On 2011-02-14, Klaus Ethgen kl...@ethgen.de wrote: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Both gives the same result, a '£' sign as expected. And what's the value in that demonstration? Yes, you can treat UTF8 like a bytestream. And the thread was about the problems that can arise of this. Kind regards Philipp Kern -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/slrnilidf3.11r.tr...@kelgar.0x539.de
Re: OT: Python (was: Make Unicode bugs release critical?)
On ma, 2011-02-14 at 14:37 +0100, Klaus Ethgen wrote: lets start a python rant. I love to hate this language. :-) Let's not. Let's not rant about any languages, or tools, or desktop environments. Let's be constructive on Debian mailing lists, shall we? We have plenty of side-channels for rants, sarcasm, snide remarks, passive-aggressiveness, and other forms of anti-social behavior, let's use those instead. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1297692931.31960.13.camel@tacticus
Re: OT: Python (was: Make Unicode bugs release critical?)
* Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Let me try... $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8 stdin: line 1, char 1, byte offset 1: invalid UTF-8 code But I don't blame Perl for that. It's documented behavior, so I can either live with that or use another language. -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214143302.ga6...@jwilk.net
Re: OT: Python (was: Make Unicode bugs release critical?)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Am Mo den 14. Feb 2011 um 15:15 schrieb Lars Wirzenius: On ma, 2011-02-14 at 14:37 +0100, Klaus Ethgen wrote: lets start a python rant. I love to hate this language. :-) Let's not. 'Till here it is personal desire. Let's not rant about any languages, or tools, or desktop environments. Let's be constructive on Debian mailing lists, shall we? You are true. I just couldn't resist if someone was trying to blame all other than the one that has the bug. Regards Klaus - -- Klaus Ethgenhttp://www.ethgen.ch/ pub 2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de Fingerprint: D7 67 71 C4 99 A6 D4 FE EA 40 30 57 3C 88 26 2B -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQEVAwUBTVk9hZ+OKpjRpO3lAQoy7Qf9EV1erqhNsAgfJ1ubQiitzufbk5Wq4rA/ rVh+Tpn4SHTE3D5Sw20UIPrUYonaQD6z8gokOkIdvzvgzVOBj3vPioFnWZy368QK DUXymUPal23q+iwwV8FYNqq7ggnwpnT0DX1PNCmMUHZl21ZkMjMJO2cuv21ycD6I JGBvA0w+dOVb7YfI+HGMwAlyT2gEkT7nsg8nlvYUU+EgzCaXjC1tdPHfe3QAYsQh Pd0QDqhxFvwVRB9SskSas1JnjUh5DKMI/USr7a/+jP6dWeVQHIRglIN5uNFCq8kW 70jM2XCdTeZcdFy1lOiJ07YCYW1gg0kKCN+DlyEFJmJUzYsfP+4KsQ== =H8Sg -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214143445.gd6...@ikki.ethgen.ch
Re: OT: Python (was: Make Unicode bugs release critical?)
On Mon, Feb 14, 2011 at 02:02:11PM +, Philipp Kern wrote: On 2011-02-14, Klaus Ethgen kl...@ethgen.de wrote: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Both gives the same result, a '£' sign as expected. And what's the value in that demonstration? Yes, you can treat UTF8 like a bytestream. And the thread was about the problems that can arise of this. Er, and tell me where exactly it makes sense to allow one encoding but not another for a bytestream? It appears that Python has a nasty bug where it ignores the encoding if isatty(stdout) returns 0. So let's go fixing or reporting that rather than arguing about it. -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214143608.ga8...@angband.pl
Re: OT: Python (was: Make Unicode bugs release critical?)
Jakub Wilk writes (Re: OT: Python (was: Make Unicode bugs release critical?)): * Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Let me try... $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8 stdin: line 1, char 1, byte offset 1: invalid UTF-8 code WTF. OK, Perl's out too. We'll have to write everything in dash :-). Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/19801.18743.486394.290...@chiark.greenend.org.uk
Re: OT: Python (was: Make Unicode bugs release critical?)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Am Mo den 14. Feb 2011 um 16:24 schrieb Ian Jackson: Jakub Wilk writes (Re: OT: Python (was: Make Unicode bugs release critical?)): * Klaus Ethgen kl...@ethgen.de, 2011-02-14, 14:37: ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' ~ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | cat Let me try... $ LC_CTYPE=en_GB.utf-8 perl -e 'print \x{00a3}\n;' | isutf8 stdin: line 1, char 1, byte offset 1: invalid UTF-8 code WTF. OK, Perl's out too. No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get a correct utf-8 character you need to print \x{c2a3} and then isutf8 is happy. We'll have to write everything in dash :-). lisp. :-) But now we get complete out of topic. Regards Klaus - -- Klaus Ethgenhttp://www.ethgen.ch/ pub 2048R/D1A4EDE5 2000-02-26 Klaus Ethgen kl...@ethgen.de Fingerprint: D7 67 71 C4 99 A6 D4 FE EA 40 30 57 3C 88 26 2B -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQEVAwUBTVlWk5+OKpjRpO3lAQohXgf9FC839X5Pozj2LZUJKd+X9Bcy5F/q+zWg cdPlFkRL2BSq05M4+V8anb6vP47JdMMJfgc1oszNWZkYOQkgZdTy1GdCVF9o0jpD xSlA7MVBt7ijTtfOlodzZiO6PyXPx7vo6AJGUufwb4KxekLR6vKq9fzlTLvvD/mH lPPbCuZrY90eWqRjFeLyXA6Cmx+cJG5jt8nAAOzBjWTuENNp+vTFx1Lad13que7T AAXrQupjCpRwAxfN8cuYMMIAFw5FCOyTQNAZXaAeMV1UOslVVdXlffUDB6uqpNvC JPPL9PhughLVWtSxsm74emFCVkBQ75xTGMJTbCUCfMmdwTj3mD7uLw== =J1JB -END PGP SIGNATURE- -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214162139.gf6...@ikki.ethgen.ch
Re: OT: Python (was: Make Unicode bugs release critical?)
Klaus Ethgen writes (Re: OT: Python (was: Make Unicode bugs release critical?)): No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get a correct utf-8 character you need to print \x{c2a3} and then isutf8 is happy. When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/19801.23455.536473.211...@chiark.greenend.org.uk
Re: OT: Python (was: Make Unicode bugs release critical?)
On Mon, 14 Feb 2011 16:43:11 + Ian Jackson ijack...@chiark.greenend.org.uk wrote: Klaus Ethgen writes (Re: OT: Python (was: Make Unicode bugs release critical?)): No, it is not. 00a3 is just not a utf-8 character, it is unicode. To get a correct utf-8 character you need to print \x{c2a3} and then isutf8 is happy. When LC_CTYPE=en_GB.utf-8, programs which attempt to print unicode characters to stdout should use UTF-8. That's what LC_TYPE means. By the way, $ LC_CTYPE=en_GB.utf-8 echo 'puts \x00a3\n'|tclsh|isutf8 $ $ LC_CTYPE=en_GB.utf-8 echo 'puts \x00a3\n'|tclsh|xxd -p c2a30a0a $ But RMS told the world not to use Tcl. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110214203601.715df57c.kos...@domain007.com