On 02/13/2013 04:45 PM, Assaf Gordon wrote:
On 02/12/2013 01:31 AM, Assaf Gordon wrote:I'd like to offer a proof-of-concept patch for adding sort-like "--key" support for the 'uniq' program, as discussed here: http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00211.html and in several other threads.One more update with two changes: 1. re-arranged "src/uniq_sort_common.h" to have the functions in the same order as in "src/sort.c", making "diff src/uniq_sort_common.h src/sort.c" much easier to view (and seeing that the functions were not modified at all). 2. when specifying explicit field separator and using "-c", report the counts with no space-padding right-aligned numbers (and the separator). This might be controversial, but I always needed that :) (used to wrap every "uniq -c" with "sed 's/^ *// ; s/ /\t/'" ) == ## Existing: $ printf "a\tx\na\tx\nb\ty\n" | uniq -c 2 a x 1 b y ## New: $ printf "a\tx\na\tx\nb\ty\n" | ./src/uniq -t $'\t' -c 2 a x 1 b y == Also, I'm wondering what exactly is the effect of the following statement ( from http://lists.gnu.org/archive/html/bug-coreutils/2006-06/msg00217.html ): "This point was addressed in IEEE Std 1003.1-2001/Cor 1-2002, item XCU/TC1/D6/40, and it's why the current Posix spec says that the behavior of uniq depends on LC_COLLATE."
And whether sort's keycompare functions fulfill this requirement, and whether the current 'uniq' tests check this situation? Otherwise my changes are not backwards-compatible.
Sort's keycompare handles that. The above was just in relation to a perf improvement to just byte compare rather than convert before comparison. We still may be able to do something more efficient along these lines when considering multibyte. A related possibility for the non multibyte case is that the -k option order doesn't matter to uniq I think, so there might be perf/cache benefits to always processing the keys in numerical rather than specified order. cheers, Pádraig.
