On 02/17/2011 01:46 PM, Bob Harris wrote: > Howdy, > > (note: I know I should give you version information with this, but (1) I > am not sure that this message will be read by anyone, and (2) I think > the problem probably transcends versions. If I get a response and the > actual version is important, I will take the time to find it.)
Thanks for the report, and you are correct that your issue transcends versions. However, if you use coreutils 8.6 or newer (the latest is 8.10), then the new --debug option would have helped you. > > I have a file of genomic short sequence info in which it so happens that > two of my sort key values are similar. The two keys are > HWI-ST407_110127_0082_A80L25ABXX:5:2:11746:46371#0/1 > HWI-ST407_110127_0082_A80L25ABXX:5:21:17464:6371#0/1 > As you can see, these are identical if one removes the colons. Which sounds like exactly what sort does when you are sorting in the en_US.UTF-8 locale. > I have tried several different options but none seem to work. -d seems > to be the default, and it has the behavior indicated above. -n fails > completely. -g also fails. Reading the man page, I don't see any other > options to control the comparison function. Then you missed this part (in the sort man page, which is in turn generated from 'sort --help'): *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. > I understand *why* -d considers these two keys equal. What I don't > understand is why there is no option that says "order them > lexicographically". That option is your set of locale-specific environment variables. Why it's not an explicit option is due to historical accident (that's the way POSIX specified it). Maybe GNU sort should add a --collate-locale=... option as an extension that overrides LC_ALL, but that seems a bit like bloat, and doesn't buy much over using the standardized means of choosing collation sequencing. > > Is there a hidden sort option that will do what I need? Yep - try 'LC_ALL=C sort ...' to see the difference. > I'm pretty sure I'm not the first person to run into this problem. You're not. It's a FAQ: http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 -- Eric Blake [email protected] +1-801-349-2682 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
