ThomasRaoux wrote: @AlexMaclean I compared the runs in ncu and there are no differences in occupancy and the arithmetic usage is roughly the same but I see some large stalls on `SR_CgaCtaId` read in a loop that comes from the extra global_smem copy: <img width="3026" height="900" alt="image" src="https://github.com/user-attachments/assets/33962f55-1132-49cf-8442-33970055647c" />
This seem to be the main reason for the significant slow down here. It seems like a legit problem from what ptxas generates and I don't think it can be workaround from user point of view. Can we go back to doing CSE for global_smem move as this seems to help code quality <img width="1887" height="450" alt="image" src="https://github.com/user-attachments/assets/dabc5330-eb75-4dad-a13b-ce4b59f08ac0" /> <img width="1887" height="450" alt="image" src="https://github.com/user-attachments/assets/fcb4dc15-9883-4a56-a8d3-b689bb057cfb" /> https://github.com/llvm/llvm-project/pull/145581 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits