Pádraig Brady writes:
Thanks a lot for working on this.
The idea is sound. I also biased towards using memchr() and strstr()
with an update to cut(1) I'm working on, as the platform specific
optimizations for those in glibc are a significant win.
One of your observances confused me though.
getc() and putchar() should also avoid function calls,
and degenerate via inlining and macros to directly manipulate stdio mem,
periodically calling __uflow() and __overflow() respectively.
Compiling uniq(1) on Fedora 43 here with default (-O2) options I see:
$ yes $(yes eeeaae | head -n9 | paste -s -d,) | head -n1M > as.in
$ ltrace -c uniq as.in
eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae,eeeaae
% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
97.95 57.971874 55 1048575 memcmp
2.05 1.211896 75 16129 __uflow
...
So function calls happen only once per 4KiB rather than per byte.
Fair correction! I disassembled the hot loop on aarch64 and found that __uflow
is indeed called only every 8K or so, so the way I phrased it was definitely
wrong.
That said, the overhead even in the non-underflow case is still relatively
substantial. On aarch64 this happens every byte:
403d2c: strb w21, [x19], #1 ; store byte to linebuffer at p, then p++
403d30: cmp w25, w22 ; compare delimiter with byte
403d34: b.eq 403dd0 ; if delimiter found, exit loop
403d38: ldp x2, x3, [x20, #8] ; load _IO_read_ptr and _IO_read_end from FILE
403d3c: cmp x2, x3 ; compare read_ptr with read_end
403d40: b.cs 403d90 ; if read_ptr >= read_end, uflow
403d44: add x3, x2, #1 ; compute read_ptr + 1
403d48: str x3, [x20, #8] ; store incremented read_ptr back to FILE
403d4c: ldrb w21, [x2] ; load byte from old read_ptr
403d50: mov w22, w21 ; copy byte for delimiter compare next
iteration
403d54: cmp x19, x23 ; compare p with end
403d58: b.ne 403d2c ; if space remains, loop back
So it's mostly the cost of per-byte FILE pointer bookkeeping in the fast path
rather than __uflow that's doing the damage. In my profiles this takes about
~90% of CPU time in the uniq benchmarks.
So how about for v2 I change the commit message from:
For example, with a 200 MB file with 5M lines, there are 200 million
getc_unlocked calls, and those clearly dominate perf profiles with 91% of
uniq runtime and 78% of comm runtime.
To:
For example, with a 200 MB file with 5M lines, the getc_unlocked loop body
executes 200 million times, performing per-byte FILE pointer bookkeeping on
each iteration, and this overhead clearly dominates profiles at 91% of uniq
runtime and 78% of comm runtime.
Open to other suggestions :-)