On Thu, Oct 29, 2009 at 5:51 PM, Eric Blake <[email protected]> wrote: > [please don't top-post on technical lists]
Sorry about the lack of mailing list etiquette, the sort manpage doesn't make it clear that [email protected] is a mailing list... > Well, that looks correct to me, if your current locale specifies that > punctuation is ignored during collation (that is, you are getting: 101000 > < 101006 < 101010, after ignoring , and .). > > http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 > > Try 'LC_ALL=C sort' to see the difference. I don't know why punctuation is not treated as a space en the en_US locale, or for that matter why the decision was made to ignore spaces in en_US (I would love to see the background thinking that went into that decision, the sorted order "San Juan, Santa Clara, San Teodoro" doesn't make intuitive sense to me). I note that the Wikipedia page on Collation says that sorting is done both ways (with or without spaces) but that ignoring spaces is supposedly more common. Anyway, thanks for explaining and sorry that I didn't see the explanation in the FAQ. Given that (according to the FAQ) "This one question arises almost more often than any other", and given the inconvenience of changing locales in a script just so sort will work right, wouldn't it make sense to just add an optional switch that effectively sets LC_ALL=C for the sort? I note now the warning in the man page: "*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values." I had no idea this would affect non-accented characters before hitting this. Could the manpage please be extended to give a simple example comparing the sort order in the en_US locale with the C locale, to make this much clearer? Thanks, Luke
