On 12/08/2015 02:26 PM, Terry Farrah wrote: > I have a tab-separated file that I think is already sorted on the first 3 > columns. Here is a 2-line sample in a file named foo: > > chr10 60379 60380 10:60380-60380 T/T > chr10 60379 60380 10:60380-60380 G/T > > I try checking it with > > sort -s -k1,1V -k2,2n -k3,3n -c foo > > but the check fails: > > sort: foo:2: disorder: chr10 60379 60380 10:60380-60380 G/T > > If I sort it using the above key specification, it swaps the order of the > lines: > > sort -s -k1,1V -k2,2n -k3,3n foo > > chr10 60379 60380 10:60380-60380 G/T > chr10 60379 60380 10:60380-60380 T/T
Doesn't reproduce for me with Fedora's coreutils-8.23-11.fc22.x86_64: $ printf 'chr10\t60379\t60380\t10:60380-60380\tT/T\nchr10\t60379\t60380\t10:60380-60380\tG/T\n' | sort -s -k1,1V -k2,2n -k3,3n chr10 60379 60380 10:60380-60380 T/T chr10 60379 60380 10:60380-60380 G/T > $ sort -s -k1,1V -k2,2n -k3,3n --debug foo > sort: using ‘en_US.UTF-8’ sorting rules > sort: leading blanks are significant in key 1; consider also specifying 'b' > chr10>60379>60380>10:60380-60380>G/T Awesome! Most bug reports fail to provide this important piece of information. You may want to follow the advice there of adding 'b' (as in -k1b,1V); but as far as I can tell, it shouldn't be affecting the behavior you are seeing (since your sample file didn't have leading whitespace). > $ sort --version > sort (GNU coreutils) 8.22 > $ more /etc/*-release > :::::::::::::: > /etc/oracle-release > :::::::::::::: > Oracle Linux Server release 7.1 > $ uname -r > 3.8.13-68.1.2.el7uek.x86_64 I suspect that the most-likely culprit is a downstream vendor bug (it is not the first time that vendor I18N patches have caused sort to misbehave, where upstream is just fine). For example, https://bugzilla.redhat.com/show_bug.cgi?id=1148347 says that some builds of RHEL 7 coreutils 8.22 had a broken I18N patch that calls strcoll() on too much of the subject line. That would certainly explain why your build seems affected, if the suffix 'G/T' vs. 'T/T' is being treated as significant, especially since you proved you are using en_US.UTF-8 (and not LC_ALL=C). But that's all the more I can point to - at this point, you'll have to take it up with Oracle. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
