On Thu, Nov 15, 2018 at 20:13:38 -0500, Emilio G. Cota wrote: > I'll generate now some more perf numbers that we could include in the > commit logs.
SPEC numbers are a net perf decrease, unfortunately: Softmmu speedup for SPEC06int (test set) 1.1 +-+--+----+----+----+----+----+----+---+----+----+----+----+----+--+-+ | | | aft+++ | 1.05 +-+........................................................|.......+-+ | +++ | | | +++ | | | | +++ | | | | 1 +-++++++++++++++++****++++++++++++++++++++++++++++++++++++***+++++++-+ | | | * * **** **** *|* | | *** +++ | * * * |* +++ *| * *|* | 0.95 +-+.*|*..***...|..*..*.*.|*..+++...|............*|.*.+++..*|*..+++.+-+ | *|* *+* *** * * * |* | | +++ *| * *** *|* *** | | *+* * * *|* * * *++* | **** | *| * *+* *|* *+* | | * * * * *|* * * * * **** * |* **** *++* * * *+* * * | 0.9 +-+.*.*..*.*..*+*.*..*.*..*.*.|*.*.|*.*|.*......*..*.*.*..*.*..*.*.+-+ | * * * * * * * * * * *++* *++* *++* +++ * * * * * * * * | | * * * * * * * * * * * * * * * * | * * * * * * * * | 0.85 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*..|...*..*.*.*..*.*..*.*.+-+ | * * * * * * * * * * * * * * * * | * * * * * * * * | | * * * * * * * * * * * * * * * * **** * * * * * * * * | | * * * * * * * * * * * * * * * * *| * * * * * * * * * | 0.8 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*.*|.*.*..*.*.*..*.*..*.*.+-+ | * * * * * * * * * * * * * * * * *| * * * * * * * * * | | * * * * * * * * * * * * * * * * *++* * * * * * * * * | 0.75 +-+-***--***--***-****-****-****-****-****-****-****-***--***--***-+-+ 401.bzi403.g429445.g456.462.libq464.h471.omn4483.xalancbgeomean png: https://imgur.com/aO39gyP Turns out that the additional instructions are the problem, despite the much lower icache miss rate. For instance, here are some numbers for h264ref running on the not-so-recent Xeon E5-2643 (i.e. Sandy Bridge): - Before: 1,137,737,512,668 instructions # 2.02 insns per cycle 563,574,505,040 cycles 5,663,616,681 L1-icache-load-misses 164.091239774 seconds time elapsed - After: 1,216,600,582,476 instructions # 2.06 insns per cycle 591,888,969,223 cycles 3,082,426,508 L1-icache-load-misses 172.232292897 seconds time elapsed It's possible that newer machines with larger reorder buffers will be able to take better advantage of the higher instruction locality, hiding the latency of having to execute more instructions. I'll test on Skylake tomorrow. Thanks, E.