Pádraig Brady writes:
Thanks a lot for working on this.

The idea is sound. I also biased towards using memchr() and strstr()
with an update to cut(1) I'm working on, as the platform specific
optimizations for those in glibc are a significant win.

One of your observances confused me though.
getc() and putchar() should also avoid function calls,
and degenerate via inlining and  macros to directly manipulate stdio mem,
periodically calling __uflow() and __overflow() respectively.

Compiling uniq(1) on Fedora 43 here with default (-O2) options I see:

$ yes $(yes eeeaae | head -n9 | paste -s -d,) | head -n1M > as.in
$ ltrace -c uniq as.in
eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
97.95   57.971874          55   1048575 memcmp
 2.05    1.211896          75     16129 __uflow
...

So function calls happen only once per 4KiB rather than per byte.

Fair correction! I disassembled the hot loop on aarch64 and found that __uflow is indeed called only every 8K or so, so the way I phrased it was definitely wrong.

That said, the overhead even in the non-underflow case is still relatively substantial. On aarch64 this happens every byte:

  403d2c: strb w21, [x19], #1     ; store byte to linebuffer at p, then p++
  403d30: cmp  w25, w22           ; compare delimiter with byte
  403d34: b.eq 403dd0             ; if delimiter found, exit loop
  403d38: ldp  x2, x3, [x20, #8]  ; load _IO_read_ptr and _IO_read_end from FILE
  403d3c: cmp  x2, x3             ; compare read_ptr with read_end
  403d40: b.cs 403d90             ; if read_ptr >= read_end, uflow
  403d44: add  x3, x2, #1         ; compute read_ptr + 1
  403d48: str  x3, [x20, #8]      ; store incremented read_ptr back to FILE
  403d4c: ldrb w21, [x2]          ; load byte from old read_ptr
  403d50: mov  w22, w21           ; copy byte for delimiter compare next 
iteration
  403d54: cmp  x19, x23           ; compare p with end
  403d58: b.ne 403d2c             ; if space remains, loop back

So it's mostly the cost of per-byte FILE pointer bookkeeping in the fast path rather than __uflow that's doing the damage. In my profiles this takes about ~90% of CPU time in the uniq benchmarks.

So how about for v2 I change the commit message from:

For example, with a 200 MB file with 5M lines, there are 200 million getc_unlocked calls, and those clearly dominate perf profiles with 91% of uniq runtime and 78% of comm runtime.

To:

For example, with a 200 MB file with 5M lines, the getc_unlocked loop body executes 200 million times, performing per-byte FILE pointer bookkeeping on each iteration, and this overhead clearly dominates profiles at 91% of uniq runtime and 78% of comm runtime.

Open to other suggestions :-)

Reply via email to