HaohaiWen wrote:

There's cross iteration true dependency in previous experiment.
```
vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0
```
The second cvt and first cvt of the next iteration need to wait for finish of 
vextract64x4. Therefore its cost is 5.
In real scenario, value of zmm0 should be reset to fpext new input.
```
vmovaps zmm0, zmm3
vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0
```
This breaks the dependency and now cost is 3.
```
# ./nanoBench.sh -init "xor zmm0, zmm0" -asm "vmovaps zmm0, zmm3; vcvtps2pd 
zmm2, ymm0; vextractf64x4 ymm0, zmm0, 1; vcvtps2pd zmm1, ymm0" -config 
configs/cfg_SkylakeX_common.txt -unroll 1000 -loop 1000 -warm_up_count 10 -cpu 0
Note: Hyper-threading is enabled; it can be disabled with "sudo ./disable-HT.sh"
CORE_CYCLES: 3.00
INST_RETIRED: 4.00
IDQ.MITE_UOPS: 6.46
IDQ.DSB_UOPS: -0.45
IDQ.MS_UOPS: 0.01
LSD.UOPS: 0.00
UOPS_ISSUED: 6.01
UOPS_EXECUTED: 5.01
UOPS_RETIRED.RETIRE_SLOTS: 6.01
UOPS_DISPATCHED_PORT.PORT_0: 2.00
UOPS_DISPATCHED_PORT.PORT_1: 0.00
UOPS_DISPATCHED_PORT.PORT_2: 0.00
UOPS_DISPATCHED_PORT.PORT_3: 0.00
UOPS_DISPATCHED_PORT.PORT_4: 0.00
UOPS_DISPATCHED_PORT.PORT_5: 3.00
UOPS_DISPATCHED_PORT.PORT_6: 0.01
UOPS_DISPATCHED_PORT.PORT_7: 0.00
```




https://github.com/llvm/llvm-project/pull/76278
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to