On 10/08/2013 10:18 PM, Gabriel Gaster wrote: > Hello all, > > I have a question about the behavior of sort -n. > > The premise of the question I asked on stackoverflow here > (http://stackoverflow.com/questions/19228968/unix-sort-n-t-gives-unexpected-result) > > > Evidently, even if a user specifies a field-separator, the entire line is > still treated as a key. If the entire key is not numeric, then sort -n does > not throw any errors and seems to not do numeric sort and rather does some > other sort (the order of which I am unclear on). This strikes me as > unexpected behavior -- because the caller can think he's going to get numeric > sort and not get numeric sort. > > As far as I can tell, specifying field-separator and calling numeric *should* > sort numerically _if_ the key is numeric. Furthermore -- and I suppose this > is the main thing -- if a field-separator is specified, then the key should > default to each field and not to the entire line. Why else would one specify > a field-separator if not to use it in this way? > > Can someone shed more light into this ? I'm also not sure if there is an > existing conversation about this, if it's being changed in a later release, > or if this is a known and long debated issue, or whatnot. > > I'm eager to make contributions in this regard, of course. I would mostly > like to know the current discussion of these things and what the current > thinking is on sort -n -t','.
The main issue here is that your input is ambiguous wrt numbers. When comparing numbers, the thousands separators are ignored (even though in your locale they are misplaced. Also note that while some of the sort funcionality is awkward, it's done like that for backwards and cross compatibility reasons. Also as you've noticed you would need to study the info documentation very carefully understand fully what's going on. So we've added the --debug option to help one figure out what's going on (probably should have been called --explain, but anyway...). So consider, the following command where we specify --debug to annotate the part of the line being matched as a number. Also -s is specified to avoid the last resort sort to simplify the illustration. $ sort --debug -s -t, -n t.csv sort: using ‘en_US.utf8’ sorting rules 12,1.080339685 ______________ 58,1.49270399401 ________________ 59,0.00182092924724 ___________________ 12,13.080339685 _______________ You can see above that the numbers are interpreted as 121... 581... 590... 1213... and sorted accordingly. If you change to the C locale where there are no thousands separators: $ LANG=C sort --debug -s -t, -n t.csv sort: using simple byte comparison 12,13.080339685 __ 12,1.080339685 __ 58,1.49270399401 __ 59,0.00182092924724 __ If for some reason you want to honor locale rules then you might try to add -k1 but then you're warned about the sort spanning multiple fields: $ sort --debug -s -t, -n -k1 t.csv /home/padraig/git/coreutils/src/sort: using ‘en_US.utf8’ sorting rules /home/padraig/git/coreutils/src/sort: key 1 is numeric and spans multiple fields 12,1.080339685 ______________ 58,1.49270399401 ________________ 59,0.00182092924724 ___________________ 12,13.080339685 _______________ So what you really want is to specify single fields like: $ sort --debug -s -t, -n -k1,1 -k2,2 t.csv /home/padraig/git/coreutils/src/sort: using ‘en_US.utf8’ sorting rules 12,1.080339685 __ ___________ 12,13.080339685 __ ____________ 58,1.49270399401 __ _____________ 59,0.00182092924724 __ ________________
