tag 12295 + notabug close 12295 stop more info below...
On 08/28/2012 04:24 PM, Lubos Kaspar wrote: > Dear GNU Coreutils Developers, > > I may have found a bug in GNU sort 5.97 as ported into RHEL 5.8: > > : $ cat /etc/redhat-release; uname -sr > : Red Hat Enterprise Linux Client release 5.8 (Tikanga) > : Linux 2.6.18-308.1.1.el5 > > : $ sort --version > : sort (GNU coreutils) 5.97 > : Copyright (C) 2006 Free Software Foundation, Inc. > : This is free software. You may redistribute copies of it under the terms of > : the GNU General Public License <http://www.gnu.org/licenses/gpl.html>. > : There is NO WARRANTY, to the extent permitted by law. > : > : Written by Mike Haertel and Paul Eggert. > > : $ man sort|grep bug > : Report bugs to <bug-coreutils@gnu.org>. > > It comes using LANG=cs_CZ.iso88592 (and verified also for > LANG=cs_CZ.utf8 and e.g. for LANG=de_DE.iso88591 or for > simple LANG=en_US, too) even when using only US-ASCII characters. > > Let me give you a very simple example when sorting some surnames > concatanated by a minus (hyphen) with related first name initials > ('Novak' is the most common Czech surname and 'Novakova' is > a modified form used for women): > > : $ cat x #content origin in reverse order than wanted > : Novakova-V > : Novak-P > : Novak-L > : Novak-J > > : $ LANG= sort x #sort it without LANG setting (expected result) > : Novak-J > : Novak-L > : Novak-P > : Novakova-V > > : $ LANG=C sort x #sort it with LANG=C setting (expected result) > : Novak-J > : Novak-L > : Novak-P > : Novakova-V > > : $ LANG=cs_CZ.iso88592 sort x #sort it using usual locale (odd result) > : Novak-J > : Novak-L > : Novakova-V > : Novak-P > > The same results can be obtained e.g. for using dot as a separator > instead of minus (hyphen). No matter using -d and/or -f and/or -s, too. > > Of course, it could be quite easily 'workarounded' in this case, e.g.: > > : $ LANG=cs_CZ.iso88592 sort -t- -k1,1 -k2,2 x > : Novak-J > : Novak-L > : Novak-P > : Novakova-V > > but it is probably impossible to do it commonly. > > Unfortunately it is also generally impossible to use LANG= or LANG=C > as some sets of data require proper sorting respecting local traditions > (e.g. to rank 'ch' between 'h' and 'i', not between 'cg' and 'ci', > consonants with carons after those without carons etc.) which should > work just using LANG=cs_CZ. > > If it is not a bug it would be very kind of you to send me some > explanation and an advice how to use 'sort' to get regular results. > In such a case please accept my deep apologies for disturbing you. > > Thank you very much for your attention and understanding. Thanks for the detailed report. However this just seems like a case of: http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 Quoting from there: "Most of the language specific locales have tables that specify the sort behavior to ignore punctuation and to fold case. This is counter intuitive to most long time computer users!" This minimal reproducer shows the same behaviour in the en_US locale: $ printf '%s\n' xo-V x-P x-L x-J | LC_ALL=en_US sort Yes this is daft default behavior, and your workaround seems like the best option for now. Perhaps in future we will be able to support more fine grained control over the sorting order. cheers, Pádraig.