OK, it turns out that LANG=en_US.UTF-8 by default. Setting LANG=C or LC_ALL=C, I get the correct expected sort order as shown below. Where does that indicate the bug lies? Where is the locale-specific comparison code?
sampleId-100,0.125 sampleId-1000,1.0 sampleId-1002,0.25 sampleId-1002,0.5 sampleId-1004,0.25 sampleId-1005,0.0625 sampleId-1005,0.125 sampleId-1006,0.125 sampleId-1007,0.125 sampleId-1008,1.0 sampleId-101,0.0625 sampleId-1010,0.0625 sampleId-1010,1.0 sampleId-1011,0.0625 sampleId-1011,0.125 sampleId-1012,0.125 sampleId-1012,0.25 sampleId-1013,1.0 sampleId-1014,1.0 sampleId-1015,1.0 sampleId-1017,1.0 sampleId-1018,0.0625 sampleId-1018,0.125 sampleId-1019,1.0 sampleId-102,0.0625 sampleId-102,0.5 sampleId-1020,1.0 sampleId-1023,1.0 sampleId-1024,0.125 sampleId-978,1.0 sampleId-979,1.0 sampleId-98,1.0 sampleId-980,1.0 sampleId-981,0.0625 sampleId-981,0.25 sampleId-982,1.0 sampleId-984,0.125 sampleId-984,0.5 sampleId-985,0.0625 sampleId-985,0.5 sampleId-986,1.0 sampleId-987,1.0 sampleId-988,1.0 sampleId-99,1.0 sampleId-990,1.0 sampleId-991,0.25 sampleId-992,0.125 sampleId-992,0.25 sampleId-995,0.0625 sampleId-995,0.25 sampleId-996,0.125 sampleId-996,0.25 sampleId-997,0.125 sampleId-997,0.5 2009/10/29 Luke Hutchison <[email protected]>: > Hi Pádraig, > As stated, "The following is the output of GNU sort (without any > switches)" -- i.e. I used the defaults, and did not specify any > commandline switches. If as you say, by default the whole line is the > sort key, and if default sorting is lexicographic order, how are the > following snippets from the sorted output possibly correct? > > sampleId-1010,0.0625 > sampleId-101,0.0625 > sampleId-1010,1.0 > > sampleId-980,1.0 > sampleId-98,1.0 > sampleId-981,0.0625 > > sampleId-990,1.0 > sampleId-99,1.0 > sampleId-991,0.25 > > Based on ASCII encoding (',' < '0' < '1'), I believe these should be: > > sampleId-101,0.0625 > sampleId-1010,0.0625 > sampleId-1010,1.0 > > sampleId-98,1.0 > sampleId-980,1.0 > sampleId-981,0.0625 > > sampleId-99,1.0 > sampleId-990,1.0 > sampleId-991,0.25 > > Even if in some weird locale, ',' > '0', or some other weird thing > were true, the two lines "sampleId-1010,0.0625" and > "sampleId-1010,1.0" should be grouped together either before or after > "sampleId-101,0.0625", because they share a common prefix > "sampleId-1010" -- but they are separated. Similarly, > "sampleId-990,1.0" and "sampleId-991,0.25" absolutely should not be > separated by "sampleId-99,1.0", because there is no way in any locale > that '0' < ',' < '1'. > > I was led to think that sorting happened field-wise (not line-wise) by > default by the man page, which says, "-t , --field-separator=SEP : use > SEP instead of non-blank to blank transition". It would be helpful to > explicitly add to the description of "-k" that "If no key is given, > the whole line is used as the key". > > Thanks, > Luke > > > 2009/10/29 Pádraig Brady <[email protected]> >> >> Luke Hutchison wrote: >> > Hi, >> > >> > The following is the output of GNU sort (without any switches) on an >> > unsorted file. Numerous errors (of the same variety) seem present in the >> > ordering. I am using coreutils-7.2-4.fc11.x86_64. Problems are shown in >> > red. >> >> You need to specify the sort command you used. >> Does this sort your data correctly? >> >> sort -t, -k1,1V >> >> > Additionally, there probably needs to be a switch added to sort that uses >> > the entire line as the sort key, >> >> It does that by default >> >> > not blank-to-non-blank transition >> >> Note also the 'b' option. >> >> cheers, >> Pádraig. >
