steffenlarsen added a comment.

Sorry for the late response. Looks like you have handled the issues and more in 
your patch. Thank you for fixing my blunders. 😄



================
Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:727
 TARGET_BUILTIN(__bmma_m8n8k128_ld_c, "vi*iC*UiIi", "", AND(SM_75,PTX63))
 TARGET_BUILTIN(__bmma_m8n8k128_mma_xor_popc_b1, "vi*iC*iC*iC*Ii", "", 
AND(SM_75,PTX63))
 TARGET_BUILTIN(__bmma_m8n8k128_st_c_i32, "vi*iC*UiIi", "", AND(SM_75,PTX63))
----------------
tra wrote:
> Bummer. mma.h in CUDA-11.3 still does not compile for Ampere.
> 
> We appear to be missing the new `__bmma_m8n8k128_mma_and_popc_b1` builtin for 
> the `.and` variant of 1-bit `mma` introduced in PTX 7.1 and not included in 
> this patch.
> 
> https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma
> 
> Do you, by any chance, have upcoming patch for PTX7.1, too. :-) 
Haha, I didn't think of that one. Sadly we don't have any plans to work on 
extending support for PTX 7.1+ in the next couple of months, but it looks like 
your new patch (D105384) takes care of it anyway. 😆 


================
Comment at: clang/test/CodeGen/builtins-nvptx-mma.cu:781-786
+  // CHECK_PTX70_SM80: call {{.*}} 
@llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32
+  // expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature 
(sm_80{{.*}},(ptx70{{.*}}}}
+  __mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 1);
+  // CHECK_PTX70_SM80: call {{.*}} 
@llvm.nvvm.wmma.m16n16k8.load.c.row.stride.f32
+  // expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature 
(sm_80{{.*}},(ptx70{{.*}}}}
+  __mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 0);
----------------
tra wrote:
> tra wrote:
> > This looks rather odd. We're calling a `tf32` builtin, but expect to see 
> > and `f32` load intrinsic. Is that expected ? 
> > 
> > 
> Never mind. I think I understand what's going on now.
> CUDA headers use  __mma_tf32 builtins. `A` and `B` operate on opaque integer 
> types. `C` and `D` operate on floats.
> However, on the PTX front we have `wmma.load.{a,b}...tf32` but 
> `wmma.load.c...f32`.
> 
> I guess it does make sense to keep LLVM intrinsic names close to the 
> instructions they produce.
> 
> 
> 
Yeah, it was definitely confusing to write. I think the current state is the 
best solution, as it prioritizes consistency within the sub-projects. Not a big 
fan of the inconsistency though, but if we want to follow CUDA's example I 
suppose we're stuck with this.


================
Comment at: clang/test/CodeGen/builtins-nvptx-mma.py:74
+          make_ldst_ops(["m8n8k4"], ["a", "b", "c", "d"], ["f64"]) +
+          make_ldst_ops(["m16n16k8"], ["a", "b"], ["tf32"]) +
+          make_ldst_ops(["m16n16k8"], ["c", "d"], ["f32"]))
----------------
tra wrote:
> This does not seem to match the generated `builtins-nvptx-mma.cu` which does 
> have `__mma_tf32_m16n16k8_ld_c`
> If I regenrate the test I see a somewhat different set of tests, possibly 
> related to the oddity I've pointed in the generated test changes in this 
> patch.
You are absolutely right. That's a mistake on my part. Looks like you've got it 
under control in your patch. Thanks!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104847/new/

https://reviews.llvm.org/D104847

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to