ThomasRaoux wrote:

@AlexMaclean I compared the runs in ncu and there are no differences in 
occupancy and the arithmetic usage is roughly the same but I see some large 
stalls on `SR_CgaCtaId` read in a loop that comes from the extra global_smem 
copy:
<img width="3026" height="900" alt="image" 
src="https://github.com/user-attachments/assets/33962f55-1132-49cf-8442-33970055647c";
 />

This seem to be the main reason for the significant slow down here. It seems 
like a legit problem from what ptxas generates and I don't think it can be 
workaround from user point of view.
Can we go back to doing CSE for global_smem move as this seems to help code 
quality

<img width="1887" height="450" alt="image" 
src="https://github.com/user-attachments/assets/dabc5330-eb75-4dad-a13b-ce4b59f08ac0";
 />

<img width="1887" height="450" alt="image" 
src="https://github.com/user-attachments/assets/fcb4dc15-9883-4a56-a8d3-b689bb057cfb";
 />


https://github.com/llvm/llvm-project/pull/145581
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to