On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote: > The result (p-core, no ht, no turbo, performance mode): > > O2 O3 PGO > cycles 2,581,832,749 8,638,401,568 9,394,200,585 > (1.07s) (3.49s) (3.80s) > instructions 12,609,600,094 11,827,675,782 12,036,010,638 > branches 2,303,416,221 2,671,184,833 2,723,414,574 > branch-misses 0.00% 7.94% 8.84% > cache-misses 3,012,613 3,055,722 3,076,316 > L1-icache-load-misses 11,416,391 12,112,703 11,896,077 > icache_tag.stalls 1,553,521 1,364,092 1,896,066 > itlb_misses.stlb_hit 6,856 21,756 22,600 > itlb_misses.walk_completed 14,430 4,454 15,084 > baclears.any 131,573 140,355 131,644 > int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993 > machine_clears.count 22,235 39,671 37,307 > dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 > frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319 > idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906 > idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650 > idq.mite_uops 207,881,687 216,734,007 212,003,064 > > > Above data shows: > o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. > o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to > aggressive inline. > o O3/PGO introduced very bad branch prediction. I will explain it later. > o Code built with O3 has high iTLB miss but much lower sTLB miss. This is > beyond > my expectation. > o O3/PGO introduced 78% and 68% more machine clears. This is interesting and > I don't know why. (subcategory MC is not measured yet) The MCs are caused by memory ordering conflict and attribute to the kernel rcu lock in I/O path, when ext4 tries to update its journal.
> o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. > o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x. > DSB hit well. So frontend fetching and decoding is not a problem for > O3/PGO. > o Other events are mainly affected by bad branch misprediction. > -- Cheers, Changbin Du
