Re: Sort order bug in GNU sort

Luke Hutchison Thu, 29 Oct 2009 17:49:14 -0700

OK, it turns out that LANG=en_US.UTF-8 by default.  Setting LANG=C or
LC_ALL=C, I get the correct expected sort order as shown below.  Where
does that indicate the bug lies?  Where is the locale-specific
comparison code?


sampleId-100,0.125
sampleId-1000,1.0
sampleId-1002,0.25
sampleId-1002,0.5
sampleId-1004,0.25
sampleId-1005,0.0625
sampleId-1005,0.125
sampleId-1006,0.125
sampleId-1007,0.125
sampleId-1008,1.0
sampleId-101,0.0625
sampleId-1010,0.0625
sampleId-1010,1.0
sampleId-1011,0.0625
sampleId-1011,0.125
sampleId-1012,0.125
sampleId-1012,0.25
sampleId-1013,1.0
sampleId-1014,1.0
sampleId-1015,1.0
sampleId-1017,1.0
sampleId-1018,0.0625
sampleId-1018,0.125
sampleId-1019,1.0
sampleId-102,0.0625
sampleId-102,0.5
sampleId-1020,1.0
sampleId-1023,1.0
sampleId-1024,0.125
sampleId-978,1.0
sampleId-979,1.0
sampleId-98,1.0
sampleId-980,1.0
sampleId-981,0.0625
sampleId-981,0.25
sampleId-982,1.0
sampleId-984,0.125
sampleId-984,0.5
sampleId-985,0.0625
sampleId-985,0.5
sampleId-986,1.0
sampleId-987,1.0
sampleId-988,1.0
sampleId-99,1.0
sampleId-990,1.0
sampleId-991,0.25
sampleId-992,0.125
sampleId-992,0.25
sampleId-995,0.0625
sampleId-995,0.25
sampleId-996,0.125
sampleId-996,0.25
sampleId-997,0.125
sampleId-997,0.5


2009/10/29 Luke Hutchison <[email protected]>:
> Hi Pádraig,
> As stated, "The following is the output of GNU sort (without any
> switches)" -- i.e. I used the defaults, and did not specify any
> commandline switches.  If as you say, by default the whole line is the
> sort key, and if default sorting is lexicographic order, how are the
> following snippets from the sorted output possibly correct?
>
> sampleId-1010,0.0625
> sampleId-101,0.0625
> sampleId-1010,1.0
>
> sampleId-980,1.0
> sampleId-98,1.0
> sampleId-981,0.0625
>
> sampleId-990,1.0
> sampleId-99,1.0
> sampleId-991,0.25
>
> Based on ASCII encoding (',' < '0' < '1'), I believe these should be:
>
> sampleId-101,0.0625
> sampleId-1010,0.0625
> sampleId-1010,1.0
>
> sampleId-98,1.0
> sampleId-980,1.0
> sampleId-981,0.0625
>
> sampleId-99,1.0
> sampleId-990,1.0
> sampleId-991,0.25
>
> Even if in some weird locale, ',' > '0', or some other weird thing
> were true, the two lines "sampleId-1010,0.0625" and
> "sampleId-1010,1.0" should be grouped together either before or after
> "sampleId-101,0.0625", because they share a common prefix
> "sampleId-1010" -- but they are separated.  Similarly,
> "sampleId-990,1.0" and "sampleId-991,0.25" absolutely should not be
> separated by "sampleId-99,1.0", because there is no way in any locale
> that '0' < ',' < '1'.
>
> I was led to think that sorting happened field-wise (not line-wise) by
> default by the man page, which says, "-t , --field-separator=SEP : use
> SEP instead of non-blank to blank transition".  It would be helpful to
> explicitly add to the description of "-k" that "If no key is given,
> the whole line is used as the key".
>
> Thanks,
> Luke
>
>
> 2009/10/29 Pádraig Brady <[email protected]>
>>
>> Luke Hutchison wrote:
>> > Hi,
>> >
>> > The following is the output of GNU sort (without any switches) on an
>> > unsorted file.  Numerous errors (of the same variety) seem present in the
>> > ordering.  I am using coreutils-7.2-4.fc11.x86_64.  Problems are shown in
>> > red.
>>
>> You need to specify the sort command you used.
>> Does this sort your data correctly?
>>
>> sort -t, -k1,1V
>>
>> > Additionally, there probably needs to be a switch added to sort that uses
>> > the entire line as the sort key,
>>
>> It does that by default
>>
>> > not blank-to-non-blank transition
>>
>> Note also the 'b' option.
>>
>> cheers,
>> Pádraig.
>

Re: Sort order bug in GNU sort

Reply via email to