HaohaiWen wrote:

> I meant - llvm-mca currently says the throughput for skylake etc. is 3cy not 
> 5cy - so do you know why the intel scheduler models are underestimating the 
> throughput?

SKX schedule model reports correct lat/uops/tpt for each instruction.
vcvtps2pd: https://uops.info/html-instr/VCVTPS2PD_ZMM_YMM.html#SKX
```
Instruction                                   Lat               TP      Uops    
Ports
VEXTRACTF64X4 (YMM, ZMM, I8)    AVX512EVEX      3       1.00 / 1.00     1 / 1   
 1*p5
```
vextractf64x4: https://uops.info/html-instr/VEXTRACTF64X4_YMM_ZMM_I8.html#SKX
```
Instruction                              Lat             TP     Uops    Ports
VCVTPS2PD (ZMM, YMM)    AVX512EVEX      7       1.00 / 1.09     2 / 2   
1*p05+1*p5
```

There're totally 5 uops, 3 for p5 and 2 for p05. I guess mca thought those 
3\*p5 and 2\*p05 can run in parallel.
We can see 2\*p05 indeed went to p0 from nanoBench result. Looks like there're 
some dependencies and they can't ideally run parallelly. I don't know uiCA 
analyzed it.

https://github.com/llvm/llvm-project/pull/76278
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to