Re: Locales/sort bug

2010-11-05 Thread Camaleón
On Fri, 05 Nov 2010 00:36:47 +0100, David Jardine wrote:

 On Thu, Nov 04, 2010 at 10:55:53PM +, Camaleón wrote:

(...)

 Heck, it's even weirder with this sequence:
 
 aph3,z
 aph3_devel,a
 aph3,b
 
 I gets sorted as:
 
 aph3,b
 aph3_devel,a
 aph3,z
 
 I'm trying to reverse-engineering the logic behind the sort but I
 can't see it. Maybe it is done randomly? Very curious, indeed.
 
 It just seems to ignore certain characters.  Try filtering the output
 through, for example, 's/[_||,]//g' and the you get it in the right
 order.

Yes, sort documentation and man page advice about that (to avoid custom 
locales while using it), but what (an how) it really does when locales 
are in use? Why ranking comma at the first place and then give 
underscore a higher priority? :-?

Greetings,

-- 
Camaleón


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/pan.2010.11.05.07.07...@gmail.com



Locales/sort bug

2010-11-04 Thread Rob Gom
Hi all,
do you think it's a bug in either libc or coreutils (sort)?

$ cat test.csv
aph3,APP,
aph3_devel,TXT,
aph3,MiB,

$ LC_ALL=C sort test.csv # expected
aph3,APP,
aph3,MiB,
aph3_devel,TXT,

$ LC_ALL=pl_PL sort test.csv  # why is that?
aph3,APP,
aph3_devel,TXT,
aph3,MiB,

$ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
aph3,APP,
aph3_devel,TXT,
aph3,MiB,

Could anyone give me a hint? I know that this is LC_COLLATE related
(LC_ALL as shorter version), but don't know whether it is my fault or
upstream bug.

I'd appreciate any comments.

Regards,
Robert


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/aanlktimst_3jnpkwahyv81c4=a07udqo5unndm47h...@mail.gmail.com



Re: Locales/sort bug

2010-11-04 Thread Camaleón
On Thu, 04 Nov 2010 20:29:02 +0100, Rob Gom wrote:

 do you think it's a bug in either libc or coreutils (sort)?
 
 $ cat test.csv
 aph3,APP,
 aph3_devel,TXT,
 aph3,MiB,
 
 $ LC_ALL=C sort test.csv # expected
 aph3,APP,
 aph3,MiB,
 aph3_devel,TXT,
 
 $ LC_ALL=pl_PL sort test.csv  # why is that? aph3,APP,
 aph3_devel,TXT,
 aph3,MiB,
 
 $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
 aph3,APP,
 aph3_devel,TXT,
 aph3,MiB,
 
 Could anyone give me a hint? I know that this is LC_COLLATE related
 (LC_ALL as shorter version), but don't know whether it is my fault or
 upstream bug.

I'm also getting that behaviour (locale set to es_ES.UTF-8) so I 
understand that my locale setting dictates underscore (_) comes first 
than comma (,) symbol.

As per man sort page:

*** WARNING *** The locale specified by the environment affects sort  
order. Set LC_ALL=C to get the traditional sort order that uses native 
byte values.

Do you think that is a bug? :-?

Greetings,

-- 
Camaleón


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/pan.2010.11.04.20.16...@gmail.com



Re: Locales/sort bug

2010-11-04 Thread Ron Johnson

On 11/04/2010 02:29 PM, Rob Gom wrote:

Hi all,
do you think it's a bug in either libc or coreutils (sort)?

$ cat test.csv
aph3,APP,
aph3_devel,TXT,
aph3,MiB,

$ LC_ALL=C sort test.csv # expected
aph3,APP,
aph3,MiB,
aph3_devel,TXT,

$ LC_ALL=pl_PL sort test.csv  # why is that?
aph3,APP,
aph3_devel,TXT,
aph3,MiB,

$ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
aph3,APP,
aph3_devel,TXT,
aph3,MiB,

Could anyone give me a hint? I know that this is LC_COLLATE related
(LC_ALL as shorter version), but don't know whether it is my fault or
upstream bug.

I'd appreciate any comments.



While it *might* be an upstream bug, it's unlikely.  (The first 
thing I learned in my first CompSci class is that it's not the 
compiler's fault that my program doesn't work...)


You just don't know what the Polish ASCII collating sequence is.

--
Seek truth from facts.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4cd31538.7060...@cox.net



Re: Locales/sort bug

2010-11-04 Thread Rob Gom
[cut]

 I'm also getting that behaviour (locale set to es_ES.UTF-8) so I
 understand that my locale setting dictates underscore (_) comes first
 than comma (,) symbol.

 As per man sort page:

 *** WARNING *** The locale specified by the environment affects sort
 order. Set LC_ALL=C to get the traditional sort order that uses native
 byte values.

 Do you think that is a bug? :-?

 Greetings,

 --
 Camaleón

If so, why do I get order comma, underscore, comma? Even better,
comma+quote+A, underscore+d,comma+quote+M. I don't get it...

Regards,
Robert


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/aanlkti=vm8um=jxkzigysytyzd97f39ysynofbhhc...@mail.gmail.com



Re: Locales/sort bug

2010-11-04 Thread Sven Joachim
On 2010-11-04 20:29 +0100, Rob Gom wrote:

 Hi all,
 do you think it's a bug in either libc or coreutils (sort)?

 $ cat test.csv
 aph3,APP,
 aph3_devel,TXT,
 aph3,MiB,

 $ LC_ALL=C sort test.csv # expected
 aph3,APP,
 aph3,MiB,
 aph3_devel,TXT,

 $ LC_ALL=pl_PL sort test.csv  # why is that?
 aph3,APP,
 aph3_devel,TXT,
 aph3,MiB,

 $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
 aph3,APP,
 aph3_devel,TXT,
 aph3,MiB,

 Could anyone give me a hint? I know that this is LC_COLLATE related
 (LC_ALL as shorter version), but don't know whether it is my fault or
 upstream bug.

 I'd appreciate any comments.

This is covered by the coreutils FAQ:
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

Sven


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87lj58n8em@turtle.gmx.de



Re: Locales/sort bug

2010-11-04 Thread Rob Gom
[cut]

 This is covered by the coreutils FAQ:
 http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

 Sven

Thanks for all the answers.

How could I know that collate is defined correctly? I understand
LC_COLLATE influence on sort operation, but I am not sure if this is
ok.
The simpliest example which causes weird behaviour is:

$ cat test2.csv
,A
_d
,M


$ LC_ALL=pl_PL sort test2.csv # and many other LC_COLLATE variants,
other than C/POSIX
,A
_d
,M

In order to achieve such behaviour, ',' should be defined as single
entity in collate definition, equal in ordering to '_'. I don't have
other explanation for that. Unfortunately, I am not good enough to
understand/verify collate definition in /usr/share/i18n :)

Regards,
Robert


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/aanlktik3fubmr0oclochjcosyqwhcahgsjvohtfrj...@mail.gmail.com



Re: Locales/sort bug

2010-11-04 Thread Rob Gom
One more thing.
If I specify LC_COLLATE to C/POSIX, special characters sorting looks
fine, but I lose Polish characters ordering.
If I specify LC_COLLATE to pl_PL.UTF-8, Polish characters ordering is
fine, but sorting goes crazy with special characters.
Is it possible to retain both features then?

carra...@laptop-rg:/tmp$ cat test2.csv
,A
_d
,M
a
ą
b
ż
ć
z
carra...@laptop-rg:/tmp$ LC_ALL=POSIX sort test2.csv
,A
,M
_d
a
b
z
ą
ć
ż

# above - correct special characters, Polish in wrong order

carra...@laptop-rg:/tmp$ LC_ALL=pl_PL.UTF-8 sort test2.csv
a
,A
ą
b
ć
_d
,M
z
ż

# above - correct Polish characters order, incorrect special characters

Feel free to replace 'correct' with 'expected' in my posts, I'm just
trying to understand what's under the hood.

Regards,
Robert


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/aanlkti=pu2np+filsqx6vcjj32sqnyajqpopqw33v...@mail.gmail.com



Re: Locales/sort bug

2010-11-04 Thread Rob Gom
I have some form of workaround.
When I know sort field separator (which was the case in my original
example), I can use that to overcome the limitations with:

$ LC_ALL=pl_PL.UTF-8 sort -k1,1 -t',' test.csv
aph3,APP,
aph3,MiB,
aph3_devel,TXT,
# everything fine

$ LC_ALL=pl_PL.UTF-8 sort test.csv
aph3,APP,
aph3_devel,TXT,
aph3,MiB,
# previous results, unexpected

My conclusion for now would be:
- if you don't know field separator
-- if there are only ASCII characters - use POSIX collate
-- if there are different characters (i18n) - don't have solution
- if you know field separator
-- specify it in sort command

Regards,
Robert


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/aanlktinloxdzj9hdafvbcq8++j5v0jecf5wkhksy4...@mail.gmail.com



Re: Locales/sort bug

2010-11-04 Thread Camaleón
On Thu, 04 Nov 2010 21:23:27 +0100, Rob Gom wrote:

 [cut]

 I'm also getting that behaviour (locale set to es_ES.UTF-8) so I
 understand that my locale setting dictates underscore (_) comes
 first than comma (,) symbol.

 As per man sort page:

 *** WARNING *** The locale specified by the environment affects sort
 order. Set LC_ALL=C to get the traditional sort order that uses native
 byte values.

 Do you think that is a bug? :-?
 
 If so, why do I get order comma, underscore, comma? Even better,
 comma+quote+A, underscore+d,comma+quote+M. I don't get it...

Mmm... you're right, I missed the first line :-?

Heck, it's even weirder with this sequence:

aph3,z
aph3_devel,a
aph3,b

I gets sorted as:

aph3,b
aph3_devel,a
aph3,z

I'm trying to reverse-engineering the logic behind the sort but I can't 
see it. Maybe it is done randomly? Very curious, indeed.

Greetings,

-- 
Camaleón


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/pan.2010.11.04.22.55...@gmail.com



Re: Locales/sort bug

2010-11-04 Thread David Jardine
On Thu, Nov 04, 2010 at 10:55:53PM +, Camaleón wrote:
 On Thu, 04 Nov 2010 21:23:27 +0100, Rob Gom wrote:
 
  [cut]
 
  I'm also getting that behaviour (locale set to es_ES.UTF-8) so I
  understand that my locale setting dictates underscore (_) comes
  first than comma (,) symbol.
 
  As per man sort page:
 
  *** WARNING *** The locale specified by the environment affects sort
  order. Set LC_ALL=C to get the traditional sort order that uses native
  byte values.
 
  Do you think that is a bug? :-?
  
  If so, why do I get order comma, underscore, comma? Even better,
  comma+quote+A, underscore+d,comma+quote+M. I don't get it...
 
 Mmm... you're right, I missed the first line :-?
 
 Heck, it's even weirder with this sequence:
 
 aph3,z
 aph3_devel,a
 aph3,b
 
 I gets sorted as:
 
 aph3,b
 aph3_devel,a
 aph3,z
 
 I'm trying to reverse-engineering the logic behind the sort but I can't 
 see it. Maybe it is done randomly? Very curious, indeed.

It just seems to ignore certain characters.  Try filtering the output 
through, for example, 's/[_||,]//g' and the you get it in the right
order.

David


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101104233647.ga2...@gennes.augarten



Re: Locales/sort bug

2010-11-04 Thread Bob Proulx
Camaleón wrote:
 I'm trying to reverse-engineering the logic behind the sort but I can't 
 see it. Maybe it is done randomly? Very curious, indeed.

It is dictionary sort ordering as specified by the locale.  Case is
folded and punctuation is (mostly) ignored.

Personally I always set the following in my ~/.bashrc file.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

Bob


signature.asc
Description: Digital signature