Re: Locales/sort bug
On Fri, 05 Nov 2010 00:36:47 +0100, David Jardine wrote: On Thu, Nov 04, 2010 at 10:55:53PM +, Camaleón wrote: (...) Heck, it's even weirder with this sequence: aph3,z aph3_devel,a aph3,b I gets sorted as: aph3,b aph3_devel,a aph3,z I'm trying to reverse-engineering the logic behind the sort but I can't see it. Maybe it is done randomly? Very curious, indeed. It just seems to ignore certain characters. Try filtering the output through, for example, 's/[_||,]//g' and the you get it in the right order. Yes, sort documentation and man page advice about that (to avoid custom locales while using it), but what (an how) it really does when locales are in use? Why ranking comma at the first place and then give underscore a higher priority? :-? Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2010.11.05.07.07...@gmail.com
Locales/sort bug
Hi all, do you think it's a bug in either libc or coreutils (sort)? $ cat test.csv aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=C sort test.csv # expected aph3,APP, aph3,MiB, aph3_devel,TXT, $ LC_ALL=pl_PL sort test.csv # why is that? aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output aph3,APP, aph3_devel,TXT, aph3,MiB, Could anyone give me a hint? I know that this is LC_COLLATE related (LC_ALL as shorter version), but don't know whether it is my fault or upstream bug. I'd appreciate any comments. Regards, Robert -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktimst_3jnpkwahyv81c4=a07udqo5unndm47h...@mail.gmail.com
Re: Locales/sort bug
On Thu, 04 Nov 2010 20:29:02 +0100, Rob Gom wrote: do you think it's a bug in either libc or coreutils (sort)? $ cat test.csv aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=C sort test.csv # expected aph3,APP, aph3,MiB, aph3_devel,TXT, $ LC_ALL=pl_PL sort test.csv # why is that? aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output aph3,APP, aph3_devel,TXT, aph3,MiB, Could anyone give me a hint? I know that this is LC_COLLATE related (LC_ALL as shorter version), but don't know whether it is my fault or upstream bug. I'm also getting that behaviour (locale set to es_ES.UTF-8) so I understand that my locale setting dictates underscore (_) comes first than comma (,) symbol. As per man sort page: *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Do you think that is a bug? :-? Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2010.11.04.20.16...@gmail.com
Re: Locales/sort bug
On 11/04/2010 02:29 PM, Rob Gom wrote: Hi all, do you think it's a bug in either libc or coreutils (sort)? $ cat test.csv aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=C sort test.csv # expected aph3,APP, aph3,MiB, aph3_devel,TXT, $ LC_ALL=pl_PL sort test.csv # why is that? aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output aph3,APP, aph3_devel,TXT, aph3,MiB, Could anyone give me a hint? I know that this is LC_COLLATE related (LC_ALL as shorter version), but don't know whether it is my fault or upstream bug. I'd appreciate any comments. While it *might* be an upstream bug, it's unlikely. (The first thing I learned in my first CompSci class is that it's not the compiler's fault that my program doesn't work...) You just don't know what the Polish ASCII collating sequence is. -- Seek truth from facts. -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/4cd31538.7060...@cox.net
Re: Locales/sort bug
[cut] I'm also getting that behaviour (locale set to es_ES.UTF-8) so I understand that my locale setting dictates underscore (_) comes first than comma (,) symbol. As per man sort page: *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Do you think that is a bug? :-? Greetings, -- Camaleón If so, why do I get order comma, underscore, comma? Even better, comma+quote+A, underscore+d,comma+quote+M. I don't get it... Regards, Robert -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlkti=vm8um=jxkzigysytyzd97f39ysynofbhhc...@mail.gmail.com
Re: Locales/sort bug
On 2010-11-04 20:29 +0100, Rob Gom wrote: Hi all, do you think it's a bug in either libc or coreutils (sort)? $ cat test.csv aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=C sort test.csv # expected aph3,APP, aph3,MiB, aph3_devel,TXT, $ LC_ALL=pl_PL sort test.csv # why is that? aph3,APP, aph3_devel,TXT, aph3,MiB, $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output aph3,APP, aph3_devel,TXT, aph3,MiB, Could anyone give me a hint? I know that this is LC_COLLATE related (LC_ALL as shorter version), but don't know whether it is my fault or upstream bug. I'd appreciate any comments. This is covered by the coreutils FAQ: http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 Sven -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/87lj58n8em@turtle.gmx.de
Re: Locales/sort bug
[cut] This is covered by the coreutils FAQ: http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 Sven Thanks for all the answers. How could I know that collate is defined correctly? I understand LC_COLLATE influence on sort operation, but I am not sure if this is ok. The simpliest example which causes weird behaviour is: $ cat test2.csv ,A _d ,M $ LC_ALL=pl_PL sort test2.csv # and many other LC_COLLATE variants, other than C/POSIX ,A _d ,M In order to achieve such behaviour, ',' should be defined as single entity in collate definition, equal in ordering to '_'. I don't have other explanation for that. Unfortunately, I am not good enough to understand/verify collate definition in /usr/share/i18n :) Regards, Robert -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktik3fubmr0oclochjcosyqwhcahgsjvohtfrj...@mail.gmail.com
Re: Locales/sort bug
One more thing. If I specify LC_COLLATE to C/POSIX, special characters sorting looks fine, but I lose Polish characters ordering. If I specify LC_COLLATE to pl_PL.UTF-8, Polish characters ordering is fine, but sorting goes crazy with special characters. Is it possible to retain both features then? carra...@laptop-rg:/tmp$ cat test2.csv ,A _d ,M a ą b ż ć z carra...@laptop-rg:/tmp$ LC_ALL=POSIX sort test2.csv ,A ,M _d a b z ą ć ż # above - correct special characters, Polish in wrong order carra...@laptop-rg:/tmp$ LC_ALL=pl_PL.UTF-8 sort test2.csv a ,A ą b ć _d ,M z ż # above - correct Polish characters order, incorrect special characters Feel free to replace 'correct' with 'expected' in my posts, I'm just trying to understand what's under the hood. Regards, Robert -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlkti=pu2np+filsqx6vcjj32sqnyajqpopqw33v...@mail.gmail.com
Re: Locales/sort bug
I have some form of workaround. When I know sort field separator (which was the case in my original example), I can use that to overcome the limitations with: $ LC_ALL=pl_PL.UTF-8 sort -k1,1 -t',' test.csv aph3,APP, aph3,MiB, aph3_devel,TXT, # everything fine $ LC_ALL=pl_PL.UTF-8 sort test.csv aph3,APP, aph3_devel,TXT, aph3,MiB, # previous results, unexpected My conclusion for now would be: - if you don't know field separator -- if there are only ASCII characters - use POSIX collate -- if there are different characters (i18n) - don't have solution - if you know field separator -- specify it in sort command Regards, Robert -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktinloxdzj9hdafvbcq8++j5v0jecf5wkhksy4...@mail.gmail.com
Re: Locales/sort bug
On Thu, 04 Nov 2010 21:23:27 +0100, Rob Gom wrote: [cut] I'm also getting that behaviour (locale set to es_ES.UTF-8) so I understand that my locale setting dictates underscore (_) comes first than comma (,) symbol. As per man sort page: *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Do you think that is a bug? :-? If so, why do I get order comma, underscore, comma? Even better, comma+quote+A, underscore+d,comma+quote+M. I don't get it... Mmm... you're right, I missed the first line :-? Heck, it's even weirder with this sequence: aph3,z aph3_devel,a aph3,b I gets sorted as: aph3,b aph3_devel,a aph3,z I'm trying to reverse-engineering the logic behind the sort but I can't see it. Maybe it is done randomly? Very curious, indeed. Greetings, -- Camaleón -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/pan.2010.11.04.22.55...@gmail.com
Re: Locales/sort bug
On Thu, Nov 04, 2010 at 10:55:53PM +, Camaleón wrote: On Thu, 04 Nov 2010 21:23:27 +0100, Rob Gom wrote: [cut] I'm also getting that behaviour (locale set to es_ES.UTF-8) so I understand that my locale setting dictates underscore (_) comes first than comma (,) symbol. As per man sort page: *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Do you think that is a bug? :-? If so, why do I get order comma, underscore, comma? Even better, comma+quote+A, underscore+d,comma+quote+M. I don't get it... Mmm... you're right, I missed the first line :-? Heck, it's even weirder with this sequence: aph3,z aph3_devel,a aph3,b I gets sorted as: aph3,b aph3_devel,a aph3,z I'm trying to reverse-engineering the logic behind the sort but I can't see it. Maybe it is done randomly? Very curious, indeed. It just seems to ignore certain characters. Try filtering the output through, for example, 's/[_||,]//g' and the you get it in the right order. David -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101104233647.ga2...@gennes.augarten
Re: Locales/sort bug
Camaleón wrote: I'm trying to reverse-engineering the logic behind the sort but I can't see it. Maybe it is done randomly? Very curious, indeed. It is dictionary sort ordering as specified by the locale. Case is folded and punctuation is (mostly) ignored. Personally I always set the following in my ~/.bashrc file. export LANG=en_US.UTF-8 export LC_COLLATE=C Bob signature.asc Description: Digital signature