David Rowley <dgrowle...@gmail.com> writes: > You can see the branch predictor has done a *much* better job in the > patched code vs master with about 10x fewer misses. This should have > helped contribute to the "insn per cycle" increase. 4.29 is quite > good for postgres. I often see that around 0.5. According to [1] > (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the > micro-op cache". I'm unsure how micro-ops translate to "insn per > cycle" that's shown in perf stat. I thought 4-5 was about the maximum > pipeline size from today's era of CPUs. Maybe someone else can explain > better than I can. In more simple terms, generally, the higher the > "insn per cycle", the better. Also, the lower all of the idle and > branch miss percentages are that's generally also better. However, > you'll notice that the patched version has more front and backend > stalls. I assume this is due to performing more instructions per cycle > from improved branch prediction causing memory and instruction stalls > to occur more frequently, effectively (I think) it's just hitting the > next bottleneck(s) - memory and instruction decoding. At least, modern > CPUs should be able to out-pace RAM in many workloads, so perhaps it's > not that surprising that "backend cycles idle" has gone up due to such > a large increase in instructions per cycle due to improved branch > prediction.
Thanks for the answer, just another area desvers to exploring. > It would be nice to see this tested on some modern Intel CPU. A 13th > series or 14th series, for example, or even any intel from the past 5 > years would be better than nothing. I have two kind of CPUs. a). Intel Xeon Processor (Icelake) for my ECS b). Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz at Mac. My ECS reports "<not supported> branch-misses", probabaly because it runs in virtualization software , and Mac doesn't support perf yet :( -- Best Regards Andy Fan