HaohaiWen wrote: > I meant - llvm-mca currently says the throughput for skylake etc. is 3cy not > 5cy - so do you know why the intel scheduler models are underestimating the > throughput?
SKX schedule model reports correct lat/uops/tpt for each instruction. vcvtps2pd: https://uops.info/html-instr/VCVTPS2PD_ZMM_YMM.html#SKX ``` Instruction Lat TP Uops Ports VEXTRACTF64X4 (YMM, ZMM, I8) AVX512EVEX 3 1.00 / 1.00 1 / 1 1*p5 ``` vextractf64x4: https://uops.info/html-instr/VEXTRACTF64X4_YMM_ZMM_I8.html#SKX ``` Instruction Lat TP Uops Ports VCVTPS2PD (ZMM, YMM) AVX512EVEX 7 1.00 / 1.09 2 / 2 1*p05+1*p5 ``` There're totally 5 uops, 3 for p5 and 2 for p05. I guess mca thought those 3\*p5 and 2\*p05 can run in parallel. We can see 2\*p05 indeed went to p0 from nanoBench result. Looks like there're some dependencies and they can't ideally run parallelly. I don't know uiCA analyzed it. https://github.com/llvm/llvm-project/pull/76278 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits