On Wed, Aug 11, 2021 at 1:43 PM Kaz Kylheku (Coreutils) <962-396-1...@kylheku.com> wrote: > > On 2021-08-11 05:03, Peng Yu wrote: > > On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist <edqu...@cs.wisc.edu> > > wrote: > >> (With just a bit more work, you can do all your sorting in a single > >> awk > >> process too (without piping out to sort), but i think you'll still be > >> disappointed with the performance compared to a single sort command.) > > > > Yes, this involves many calls of the coreuils' sort, which is not > > No, not this last remark, which is about "in a single awk process".
I know there is one awk process. I don't understand why you mentioned it. > > efficient. Would it make sense to add an option in sort so that sort > > can sort a partially sorted input in one shot. > > IF you're willing to use GNU Coreutils instead of Unix, you probably > have I don't think using awk is efficient. I am program a number awk programs for simple transforming the input and tested it, in general, it is slower than the equivalent python code, let along C code. You can talk about doing most of the work in awk below. I don't think that make sense. Having coreutils' sort be able to do a partial sort is a more reasonable solution. > GNU Awk also. GNU Awk has a sorting function using which a solution > could be cobbed together. Maybe something like: > > > function dump_delete_data() > { > n = asorti(data, idx); > for (i = 1; i <= n; i++) > print data[idx[i]]; > delete data > serial = 0 > } > > BEGIN { serial = 0 } > $1 != prev_1 { dump_delete_data() } > NF >= 2 { prev_1 = $1 > data[$2 "." serial++] = $0 > next } > 1 { dump_delete_data() > print } > END { dump_delete_data() } > > > The asorti function has some features behind it to sort in various ways; > you have to look into that. It involves manipulating a > PROCINFO["sorted_in"] > value. > > It's possible to use a custom comparison function. > > For more info, see GNU Awk documentation, the Gawk mailing list or > the comp.lang.awk newsgroup. > > The purpose of the serial variable in my above code so that we get > two entries in data[] if in a given group, there are identical $2 > values. > > For instance if $2 is "foo", then the key we use is actually "foo.3" if > the current value of serial is 3. The sorting is then done on these > suffixed keys, which works okay for lexicographic sorting. > > It is not a stable sort, though! Because foo.123 will be sorted before > foo.23, even though the 123 serial value comes later. If we padded the > integer with enough leading zeros for the largest possible group, it > would then be stable: foo.00023 would come before foo.00123: > > data[sprintf("%s.%08X", $2, serial++)] = $0 > > kind of thing. > > If you don't care about reproducing duplicates, you can remove this > logic > entirely. > > How the overall program works is that data[] is an array indexed on the > second column values (plus serial suffixes). The value of each index > value is the entire record, $0. > > asorti sorts the $2 indices, throwing away the $0 values, which > is why we direct it into a secondary array called idx, preserving > the data array. The idx array ends up indexed on integer values 1 to N, > where N is the chunk size. If we iterate over these values, idx[i] > gives us the $2 column values (with serial suffix) in sorted order. > We can then use that as the key into data[] to get the corresponding > records in sorted order. > > Cheers ... > -- Regards, Peng