On 10/08/2013 03:18 PM, Gabriel Gaster wrote: > Hello all, > > I have a question about the behavior of sort -n. > > The premise of the question I asked on stackoverflow here > (http://stackoverflow.com/questions/19228968/unix-sort-n-t-gives-unexpected-result) >
Rather than make us chasing a link, you could have posted an actual example here: $ cat example.csv # here's a small example 58,1.49270399401 59,0.000192136419373 59,0.00182092924724 59,1.49270399401 60,0.00182092924724 60,1.49270399401 12,13.080339685 12,14.1531049905 12,26.7613447051 12,50.4592437035 $ cat example.csv | sort -n --field-separator=, 58,1.49270399401 59,0.000192136419373 59,0.00182092924724 59,1.49270399401 60,0.00182092924724 60,1.49270399401 12,13.080339685 12,14.1531049905 12,26.7613447051 12,50.4592437035 By the way, if you use 'sort --debug', you'll learn a lot more about what sort is actually doing: $ cat <<\EOF | LC_ALL=C sort -n --debug --field-separator=, 58,1.49270399401 59,0.000192136419373 59,0.00182092924724 59,1.49270399401 60,0.00182092924724 60,1.49270399401 12,13.080339685 12,14.1531049905 12,26.7613447051 12,50.4592437035 EOF sort: using simple byte comparison 12,13.080339685 __ _______________ 12,14.1531049905 __ ________________ 12,26.7613447051 __ ________________ 12,50.4592437035 __ ________________ 58,1.49270399401 __ ________________ 59,0.000192136419373 __ ____________________ 59,0.00182092924724 __ ___________________ 59,1.49270399401 __ ________________ 60,0.00182092924724 __ ___________________ 60,1.49270399401 __ ________________ In the C locale, a numeric sort stops at the first non-numeric character, and since the C locale does not have thousand's separators, it stops at the comma. $ cat <<\EOF | sort -n --debug --field-separator=, 58,1.49270399401 59,0.000192136419373 59,0.00182092924724 59,1.49270399401 60,0.00182092924724 60,1.49270399401 12,13.080339685 12,14.1531049905 12,26.7613447051 12,50.4592437035 EOF sort: using ‘en_US.UTF-8’ sorting rules 58,1.49270399401 ________________ ________________ 59,0.000192136419373 ____________________ ____________________ 59,0.00182092924724 ___________________ ___________________ 59,1.49270399401 ________________ ________________ 60,0.00182092924724 ___________________ ___________________ 60,1.49270399401 ________________ ________________ 12,13.080339685 _______________ _______________ 12,14.1531049905 ________________ ________________ 12,26.7613447051 ________________ ________________ 12,50.4592437035 ________________ ________________ In the en_US.UTF-8 locale, thousands separators exist, so the numeric parser keeps on going until the first non-numeric character (yeah, you aren't really using comma as a thousands separator, but such is life). And finally, look what happens when you explicitly tell sort to quit looking after the boundary of the first field, rather than the implied -k1 which looks starting at the first field until a non-numeric character: $ cat <<\EOF | sort -n -k1,1 --debug --field-separator=, 58,1.49270399401 59,0.000192136419373 59,0.00182092924724 59,1.49270399401 60,0.00182092924724 60,1.49270399401 12,13.080339685 12,14.1531049905 12,26.7613447051 12,50.4592437035 EOF sort: using ‘en_US.UTF-8’ sorting rules 12,13.080339685 __ _______________ 12,14.1531049905 __ ________________ 12,26.7613447051 __ ________________ 12,50.4592437035 __ ________________ 58,1.49270399401 __ ________________ 59,0.000192136419373 __ ____________________ 59,0.00182092924724 __ ___________________ 59,1.49270399401 __ ________________ 60,0.00182092924724 __ ___________________ 60,1.49270399401 __ ________________ > > Can someone shed more light into this ? I'm also not sure if there is an > existing conversation about this, Yes, it's a FAQ: https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 and sort is doing what POSIX behaves for your particular machine's definitions of locales, and in turn their description of how collation and numeric parsing will perform in that locale. Except for the C locale, different vendors have tended to have different rules, even for locales that are otherwise named the same. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
