ThomasRaoux wrote: > The problem seems to be that we're now reusing the `MOV_B64_i` instruction to > move the address of the global into a register. This instruction is marked as > `isAsCheapAsAMove = true` so we no longer bother to do CSE on it. This > doesn't necessarily seem like a problem or incorrect so I'm hesitant to "fix" > it by re-introducing a non-cheap mov instruction for global-addresses. We've > perturbed PTX a little bit and that can sometimes cause both regressions and > improvements. > > @ThomasRaoux have you experimented with using maxnreg or --maxrregcount to > help PTXAS out here? If this kernel doesn't have a register target, this > might be the sort of thing that could change the compiler's guess about what > it should be.
Looking at the sass it doesn't use extra registers. I see extra arithmetic in the loop. I need to take a ncu trace to understand why it makes a significant difference but it might just be extra arithmetic and worse scheduling. If ptxas doesn't treat this move as a no-op CSEing it would be nice. I'll check if I can find a workaround otherwise I'm not sure how to unblock this as the performance drop will be blocking our LLVM upgrade. https://github.com/llvm/llvm-project/pull/145581 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits