On Thu, 18 Jun 2026 20:53:37 +0200 Johannes Berg <[email protected]> wrote:
> (hah, just found this window open from this morning ...) > > On Thu, 2026-06-18 at 09:39 +0300, Andy Shevchenko wrote: > > On Wed, Jun 17, 2026 at 10:30:56PM +0100, David Laight wrote: > > > On Wed, 17 Jun 2026 14:56:09 +0200 > > > Johannes Berg <[email protected]> wrote: > > > > On Wed, 2026-06-17 at 13:12 +0200, Andy Shevchenko wrote: > > > > > Convert size_add() to take variadic argument, so we can simplify users > > > > > with using a macro only once. > > > > > > > > > +#define __size_add3(addend1, addend2, addend3, addend4, ...) > > > > > \ > > > > > + __size_add(__size_add2(addend1, addend2, addend3), addend4) > > > > > +#define __size_add4(addend1, addend2, addend3, addend4, addend5, > > > > > ...) \ > > > > > + __size_add(__size_add3(addend1, addend2, addend3, addend4), > > > > > addend5) > > > > > > > > I guess it's not going to really matter, but it would generate fewer > > > > calls to have something more like > > > > > > > > #define __size_add3(a1, a2, a3, a4) \ > > > > size_add(size_add(a1, a2), size_add(a3, a4)) > > > > #define __size_add4(a1, a2, a3, a4, a5) \ > > > > size_add(size_add(a1, a2), size_add(a3, a4, a5)) > > > > > > > > as a binary tree, rather than only cutting one off every time. Not sure > > > > that results in hugely different code though - maybe fewer overflow > > > > checks? > > > > Good question. I'm also thinking that one-by-one may expand in too much of > > preprocessor code (haven't checked myself). > > No. I was confused, and managed to confuse you too perhaps, sorry! > > We have to have the same number of operations (__size_add calls) > regardless, since you have to add it all up: 1 + 2 + 3 + 4 + 5 has a > fixed number of + signs regardless of how you parenthesise it. > > I guess actual CPU execution would have a better data dependency tree if > we balance it, Absolutely. Intel Haswell onwards and zen1-4 can execute 4 independent add/sub/and/or (etc) every clock. zen5 wins with 6 arithmetic ops or 4 cmov (and 2 alu) per clock. > but ... if our hotpath depends on size_add() we've lost already. I've no idea what the compiler generates, but a cmovc to copy in ~0 when the add sets carry stands a good chance of being pretty near the best. What you don't want is a conditional jump. The add, cmov pair will take two clocks, but the pairs are independent of each other (the carry flag isn't a limitation). The cpu should be able to execute two add and two cmov every clock. So with 4 values the 'tree' version is 4 clocks The other problem with ((a + b) + c) + d is that execution can't start until both a and b are available; with (a + b) + (c + d) it is much more likely that one of the adds can be executed early. Trying to guess the performance of modern cpu is non-trivial. David > > johannes

