On 12/12/15 22:53, Holger Klene wrote: > Hello! > > > > Given a text-file "sort.but.txt" with find-output like this: > > 07. Feb 2015 15:57 ./mess.jpg > 05. Mär 2015 13:30 ./mess.jpg > > > > Basically two columns: a date and a filename > > I want sort to discard the duplicate lines for the same file using -u to keep > only the first and -k to skip over the date column > >> sort sort.bug.txt -u -s -k 1.20 --debug
Note the -s is implicit with -u. Ideally the above should just work, and does on Fedora/RHEL/Suse with the i18n patch applied. Details on that patch at http://www.pixelbeat.org/docs/coreutils_i18n/ > sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet > sort: führende Leerzeichen sind signifikant in Schlüssel 1: Sie sollten daher > wahrscheinlich auch „b“ angeben > 05. Mär 2015 13:30 ./mess.jpg > ___________ > 07. Feb 2015 15:57 ./mess.jpg > __________ > > As the underlines in debug mode show, the keys start position depends on > whether the month name contains pure ASCII or the German Umlaut ä. > > There's a hint coming up, to apply option -b as this one character offset > could possibly be overcome thanks to the separating whitespace between the > columns. > >> sort sort.bug.txt -u -s -k 1.20 -b --debug > > sort: es werden die Sortierregeln für »de_DE.UTF-8“ verwendet > 05. Mär 2015 13:30 ./mess.jpg > __________ > 07. Feb 2015 15:57 ./mess.jpg > __________ > > In fact, it does correct the underlines, but still -u gives both lines, > though I want it to discard the second line. You can add more lines for the > same file, but sort insists on keeping exactly two: one with Umlaut and the > other without. That's a bug in --debug because the implementation was split from the actual processing done during the sort (for performance reasons). Therefore we'll need to fix --debug to show what's being actually done which is... -b is applied _before_ the -k offsets are determined, and so is ineffective in your case. That is confirmed with: $ ltrace -e strcoll sort sort.bug.txt -u -k 1.20b sort->strcoll("./mess.jpg", " ./mess.jpg") = 15 05. Mär 2015 13:30 ./mess.jpg sort->strcoll("./mess.jpg", " ./mess.jpg") = 15 07. Feb 2015 15:57 ./mess.jpg Perhaps it would be better in your case to operate directly on the fifth field? $ sort sort.bug.txt -u -k5b,5 --debug sort: using ‘en_IE.utf8’ sorting rules 07. Feb 2015 15:57 ./mess.jpg __________ thanks, Pádraig