steffenlarsen added a comment. Sorry for the late response. Looks like you have handled the issues and more in your patch. Thank you for fixing my blunders. 😄
================ Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:727 TARGET_BUILTIN(__bmma_m8n8k128_ld_c, "vi*iC*UiIi", "", AND(SM_75,PTX63)) TARGET_BUILTIN(__bmma_m8n8k128_mma_xor_popc_b1, "vi*iC*iC*iC*Ii", "", AND(SM_75,PTX63)) TARGET_BUILTIN(__bmma_m8n8k128_st_c_i32, "vi*iC*UiIi", "", AND(SM_75,PTX63)) ---------------- tra wrote: > Bummer. mma.h in CUDA-11.3 still does not compile for Ampere. > > We appear to be missing the new `__bmma_m8n8k128_mma_and_popc_b1` builtin for > the `.and` variant of 1-bit `mma` introduced in PTX 7.1 and not included in > this patch. > > https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma > > Do you, by any chance, have upcoming patch for PTX7.1, too. :-) Haha, I didn't think of that one. Sadly we don't have any plans to work on extending support for PTX 7.1+ in the next couple of months, but it looks like your new patch (D105384) takes care of it anyway. 😆 ================ Comment at: clang/test/CodeGen/builtins-nvptx-mma.cu:781-786 + // CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32 + // expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature (sm_80{{.*}},(ptx70{{.*}}}} + __mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 1); + // CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.c.row.stride.f32 + // expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature (sm_80{{.*}},(ptx70{{.*}}}} + __mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 0); ---------------- tra wrote: > tra wrote: > > This looks rather odd. We're calling a `tf32` builtin, but expect to see > > and `f32` load intrinsic. Is that expected ? > > > > > Never mind. I think I understand what's going on now. > CUDA headers use __mma_tf32 builtins. `A` and `B` operate on opaque integer > types. `C` and `D` operate on floats. > However, on the PTX front we have `wmma.load.{a,b}...tf32` but > `wmma.load.c...f32`. > > I guess it does make sense to keep LLVM intrinsic names close to the > instructions they produce. > > > Yeah, it was definitely confusing to write. I think the current state is the best solution, as it prioritizes consistency within the sub-projects. Not a big fan of the inconsistency though, but if we want to follow CUDA's example I suppose we're stuck with this. ================ Comment at: clang/test/CodeGen/builtins-nvptx-mma.py:74 + make_ldst_ops(["m8n8k4"], ["a", "b", "c", "d"], ["f64"]) + + make_ldst_ops(["m16n16k8"], ["a", "b"], ["tf32"]) + + make_ldst_ops(["m16n16k8"], ["c", "d"], ["f32"])) ---------------- tra wrote: > This does not seem to match the generated `builtins-nvptx-mma.cu` which does > have `__mma_tf32_m16n16k8_ld_c` > If I regenrate the test I see a somewhat different set of tests, possibly > related to the oddity I've pointed in the generated test changes in this > patch. You are absolutely right. That's a mistake on my part. Looks like you've got it under control in your patch. Thanks! Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D104847/new/ https://reviews.llvm.org/D104847 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits