[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)

2024-03-05 Thread Joseph Huber via cfe-commits

jhuber6 wrote:

> I think we would be better off teaching an IR optimizer pass to recognize the 
> divide pattern and remap it to the load from the new location, rather than 
> forcing the complexity into every frontend

That's fair. I would've argued that this version should've been the builtin and 
the grid size be the computed one but it's definitely not ideal to have 
multiple versions of this. I'll try to find a place to do this peephole 
optimization. 

https://github.com/llvm/llvm-project/pull/83927
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)

2024-03-05 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm edited https://github.com/llvm/llvm-project/pull/83927
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)

2024-03-05 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm commented:

I think we would be better off teaching an IR optimizer pass to recognize the 
divide pattern and remap it to the load from the location, rather than forcing 
the complexity into every frontend 

https://github.com/llvm/llvm-project/pull/83927
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)

2024-03-04 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 updated 
https://github.com/llvm/llvm-project/pull/83927

>From 56059fdb5a0e22f8c7dcce6642899fdccf77a55b Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Mon, 4 Mar 2024 17:27:28 -0600
Subject: [PATCH] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin

Summary:
The AMDGPU traget was originally designed with OpenCL in mind. The first
verisions only provided the grid size, which is the total numver of
threads in the execution context. In order to get the number of "blocks"
in the CUDA sense you then had to divide by the number of threads in the
current work group.

The switch from COV4 to COV5 changed the way these arguments are encoded
and added a new offset for the "block" size. This patch introduces code
to access this directly instead. The name was chosen at `num_workgroups`
as the OpenCL standard doesn't seem to have a good name for this concept
and calling them "blocks" is just CUDA (even though they're the same
thing).

This patch also provides support for the old COV4 format by doing the
divide of the grid and workgroup sizes. This is so we can switch over to
this in the OpenMP runtime even though it's not the officially supported
version anymore. I tested this using my libc utilities on both versions
and it functioned as expected.
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def |   4 +
 clang/lib/CodeGen/CGBuiltin.cpp  |  99 +--
 clang/test/CodeGen/amdgpu-abi-version.c  | 174 +--
 clang/test/CodeGenOpenCL/builtins-amdgcn.cl  |  18 ++
 4 files changed, 264 insertions(+), 31 deletions(-)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 213311b96df74f..43f3f500bf8056 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -42,6 +42,10 @@ BUILTIN(__builtin_amdgcn_workgroup_size_x, "Us", "nc")
 BUILTIN(__builtin_amdgcn_workgroup_size_y, "Us", "nc")
 BUILTIN(__builtin_amdgcn_workgroup_size_z, "Us", "nc")
 
+BUILTIN(__builtin_amdgcn_num_workgroups_x, "Ui", "nc")
+BUILTIN(__builtin_amdgcn_num_workgroups_y, "Ui", "nc")
+BUILTIN(__builtin_amdgcn_num_workgroups_z, "Ui", "nc")
+
 BUILTIN(__builtin_amdgcn_grid_size_x, "Ui", "nc")
 BUILTIN(__builtin_amdgcn_grid_size_y, "Ui", "nc")
 BUILTIN(__builtin_amdgcn_grid_size_z, "Ui", "nc")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 9ee51ca7142c77..f2f1fc1abbda92 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -17816,36 +17816,39 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) 
{
   return Call;
 }
 
-// \p Index is 0, 1, and 2 for x, y, and z dimension, respectively.
+/// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by
+///   clang during compilation of user code.
+Value *getAMDGPUABIVersion(CodeGenFunction ) {
+  StringRef Name = "__oclc_ABI_version";
+  auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name);
+  if (!ABIVersionC)
+ABIVersionC = new llvm::GlobalVariable(
+CGF.CGM.getModule(), CGF.Int32Ty, false,
+llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr,
+llvm::GlobalVariable::NotThreadLocal,
+CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant));
+
+  // This load will be eliminated by the IPSCCP because it is constant
+  // weak_odr without externally_initialized. Either changing it to weak or
+  // adding externally_initialized will keep the load.
+  return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC,
+   CGF.CGM.getIntAlign());
+}
+
 /// Emit code based on Code Object ABI version.
 /// COV_4: Emit code to use dispatch ptr
 /// COV_5+   : Emit code to use implicitarg ptr
 /// COV_NONE : Emit code to load a global variable "__oclc_ABI_version"
 ///and use its value for COV_4 or COV_5+ approach. It is used for
 ///compiling device libraries in an ABI-agnostic way.
-///
-/// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by
-///   clang during compilation of user code.
 Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) {
+  assert(Index < 3 && "Invalid dimension argument");
   llvm::LoadInst *LD;
 
   auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion;
 
   if (Cov == CodeObjectVersionKind::COV_None) {
-StringRef Name = "__oclc_ABI_version";
-auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name);
-if (!ABIVersionC)
-  ABIVersionC = new llvm::GlobalVariable(
-  CGF.CGM.getModule(), CGF.Int32Ty, false,
-  llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr,
-  llvm::GlobalVariable::NotThreadLocal,
-  CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant));
-
-// This load will be eliminated by the IPSCCP because it is constant
-// weak_odr without externally_initialized. Either changing it 

[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)

2024-03-04 Thread via cfe-commits

github-actions[bot] wrote:




:warning: C/C++ code formatter, clang-format found issues in your code. 
:warning:



You can test this locally with the following command:


``bash
git-clang-format --diff 32e2294b8abba6b70356aa37b65acf155506d457 
b7090a51b8322cc1c05f5a05894fef5a56dcbcf7 -- clang/lib/CodeGen/CGBuiltin.cpp 
clang/test/CodeGen/amdgpu-abi-version.c
``





View the diff from clang-format here.


``diff
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index fac1921d0f..f2f1fc1abb 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -17819,20 +17819,20 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) 
{
 /// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by
 ///   clang during compilation of user code.
 Value *getAMDGPUABIVersion(CodeGenFunction ) {
-StringRef Name = "__oclc_ABI_version";
-auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name);
-if (!ABIVersionC)
-  ABIVersionC = new llvm::GlobalVariable(
-  CGF.CGM.getModule(), CGF.Int32Ty, false,
-  llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr,
-  llvm::GlobalVariable::NotThreadLocal,
-  CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant));
-
-// This load will be eliminated by the IPSCCP because it is constant
-// weak_odr without externally_initialized. Either changing it to weak or
-// adding externally_initialized will keep the load.
-return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC,
- CGF.CGM.getIntAlign());
+  StringRef Name = "__oclc_ABI_version";
+  auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name);
+  if (!ABIVersionC)
+ABIVersionC = new llvm::GlobalVariable(
+CGF.CGM.getModule(), CGF.Int32Ty, false,
+llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr,
+llvm::GlobalVariable::NotThreadLocal,
+CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant));
+
+  // This load will be eliminated by the IPSCCP because it is constant
+  // weak_odr without externally_initialized. Either changing it to weak or
+  // adding externally_initialized will keep the load.
+  return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC,
+   CGF.CGM.getIntAlign());
 }
 
 /// Emit code based on Code Object ABI version.

``




https://github.com/llvm/llvm-project/pull/83927
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)

2024-03-04 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)


Changes

Summary:
The AMDGPU traget was originally designed with OpenCL in mind. The first
verisions only provided the grid size, which is the total numver of
threads in the execution context. In order to get the number of "blocks"
in the CUDA sense you then had to divide by the number of threads in the
current work group.

The switch from COV4 to COV5 changed the way these arguments are encoded
and added a new offset for the "block" size. This patch introduces code
to access this directly instead. The name was chosen at `num_workgroups`
as the OpenCL standard doesn't seem to have a good name for this concept
and calling them "blocks" is just CUDA (even though they're the same
thing).

This patch also provides support for the old COV4 format by doing the
divide of the grid and workgroup sizes. This is so we can switch over to
this in the OpenMP runtime even though it's not the officially supported
version anymore. I tested this using my libc utilities on both versions
and it functioned as expected.


---

Patch is 20.70 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/83927.diff


4 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+4) 
- (modified) clang/lib/CodeGen/CGBuiltin.cpp (+79-16) 
- (modified) clang/test/CodeGen/amdgpu-abi-version.c (+161-13) 
- (modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+18) 


``diff
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 213311b96df74f..43f3f500bf8056 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -42,6 +42,10 @@ BUILTIN(__builtin_amdgcn_workgroup_size_x, "Us", "nc")
 BUILTIN(__builtin_amdgcn_workgroup_size_y, "Us", "nc")
 BUILTIN(__builtin_amdgcn_workgroup_size_z, "Us", "nc")
 
+BUILTIN(__builtin_amdgcn_num_workgroups_x, "Ui", "nc")
+BUILTIN(__builtin_amdgcn_num_workgroups_y, "Ui", "nc")
+BUILTIN(__builtin_amdgcn_num_workgroups_z, "Ui", "nc")
+
 BUILTIN(__builtin_amdgcn_grid_size_x, "Ui", "nc")
 BUILTIN(__builtin_amdgcn_grid_size_y, "Ui", "nc")
 BUILTIN(__builtin_amdgcn_grid_size_z, "Ui", "nc")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 9ee51ca7142c77..fac1921d0f8a4a 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -17816,22 +17816,9 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) {
   return Call;
 }
 
-// \p Index is 0, 1, and 2 for x, y, and z dimension, respectively.
-/// Emit code based on Code Object ABI version.
-/// COV_4: Emit code to use dispatch ptr
-/// COV_5+   : Emit code to use implicitarg ptr
-/// COV_NONE : Emit code to load a global variable "__oclc_ABI_version"
-///and use its value for COV_4 or COV_5+ approach. It is used for
-///compiling device libraries in an ABI-agnostic way.
-///
 /// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by
 ///   clang during compilation of user code.
-Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) {
-  llvm::LoadInst *LD;
-
-  auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion;
-
-  if (Cov == CodeObjectVersionKind::COV_None) {
+Value *getAMDGPUABIVersion(CodeGenFunction ) {
 StringRef Name = "__oclc_ABI_version";
 auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name);
 if (!ABIVersionC)
@@ -17844,8 +17831,24 @@ Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , 
unsigned Index) {
 // This load will be eliminated by the IPSCCP because it is constant
 // weak_odr without externally_initialized. Either changing it to weak or
 // adding externally_initialized will keep the load.
-Value *ABIVersion = CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC,
-  CGF.CGM.getIntAlign());
+return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC,
+ CGF.CGM.getIntAlign());
+}
+
+/// Emit code based on Code Object ABI version.
+/// COV_4: Emit code to use dispatch ptr
+/// COV_5+   : Emit code to use implicitarg ptr
+/// COV_NONE : Emit code to load a global variable "__oclc_ABI_version"
+///and use its value for COV_4 or COV_5+ approach. It is used for
+///compiling device libraries in an ABI-agnostic way.
+Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) {
+  assert(Index < 3 && "Invalid dimension argument");
+  llvm::LoadInst *LD;
+
+  auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion;
+
+  if (Cov == CodeObjectVersionKind::COV_None) {
+Value *ABIVersion = getAMDGPUABIVersion(CGF);
 
 Value *IsCOV5 = CGF.Builder.CreateICmpSGE(
 ABIVersion,
@@ -17901,6 +17904,58 @@ Value *EmitAMDGPUGridSize(CodeGenFunction , 
unsigned Index) {
   

[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)

2024-03-04 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 created 
https://github.com/llvm/llvm-project/pull/83927

Summary:
The AMDGPU traget was originally designed with OpenCL in mind. The first
verisions only provided the grid size, which is the total numver of
threads in the execution context. In order to get the number of "blocks"
in the CUDA sense you then had to divide by the number of threads in the
current work group.

The switch from COV4 to COV5 changed the way these arguments are encoded
and added a new offset for the "block" size. This patch introduces code
to access this directly instead. The name was chosen at `num_workgroups`
as the OpenCL standard doesn't seem to have a good name for this concept
and calling them "blocks" is just CUDA (even though they're the same
thing).

This patch also provides support for the old COV4 format by doing the
divide of the grid and workgroup sizes. This is so we can switch over to
this in the OpenMP runtime even though it's not the officially supported
version anymore. I tested this using my libc utilities on both versions
and it functioned as expected.


>From b7090a51b8322cc1c05f5a05894fef5a56dcbcf7 Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Mon, 4 Mar 2024 17:27:28 -0600
Subject: [PATCH] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin

Summary:
The AMDGPU traget was originally designed with OpenCL in mind. The first
verisions only provided the grid size, which is the total numver of
threads in the execution context. In order to get the number of "blocks"
in the CUDA sense you then had to divide by the number of threads in the
current work group.

The switch from COV4 to COV5 changed the way these arguments are encoded
and added a new offset for the "block" size. This patch introduces code
to access this directly instead. The name was chosen at `num_workgroups`
as the OpenCL standard doesn't seem to have a good name for this concept
and calling them "blocks" is just CUDA (even though they're the same
thing).

This patch also provides support for the old COV4 format by doing the
divide of the grid and workgroup sizes. This is so we can switch over to
this in the OpenMP runtime even though it's not the officially supported
version anymore. I tested this using my libc utilities on both versions
and it functioned as expected.
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def |   4 +
 clang/lib/CodeGen/CGBuiltin.cpp  |  95 --
 clang/test/CodeGen/amdgpu-abi-version.c  | 174 +--
 clang/test/CodeGenOpenCL/builtins-amdgcn.cl  |  18 ++
 4 files changed, 262 insertions(+), 29 deletions(-)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 213311b96df74f..43f3f500bf8056 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -42,6 +42,10 @@ BUILTIN(__builtin_amdgcn_workgroup_size_x, "Us", "nc")
 BUILTIN(__builtin_amdgcn_workgroup_size_y, "Us", "nc")
 BUILTIN(__builtin_amdgcn_workgroup_size_z, "Us", "nc")
 
+BUILTIN(__builtin_amdgcn_num_workgroups_x, "Ui", "nc")
+BUILTIN(__builtin_amdgcn_num_workgroups_y, "Ui", "nc")
+BUILTIN(__builtin_amdgcn_num_workgroups_z, "Ui", "nc")
+
 BUILTIN(__builtin_amdgcn_grid_size_x, "Ui", "nc")
 BUILTIN(__builtin_amdgcn_grid_size_y, "Ui", "nc")
 BUILTIN(__builtin_amdgcn_grid_size_z, "Ui", "nc")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 9ee51ca7142c77..fac1921d0f8a4a 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -17816,22 +17816,9 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) {
   return Call;
 }
 
-// \p Index is 0, 1, and 2 for x, y, and z dimension, respectively.
-/// Emit code based on Code Object ABI version.
-/// COV_4: Emit code to use dispatch ptr
-/// COV_5+   : Emit code to use implicitarg ptr
-/// COV_NONE : Emit code to load a global variable "__oclc_ABI_version"
-///and use its value for COV_4 or COV_5+ approach. It is used for
-///compiling device libraries in an ABI-agnostic way.
-///
 /// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by
 ///   clang during compilation of user code.
-Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) {
-  llvm::LoadInst *LD;
-
-  auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion;
-
-  if (Cov == CodeObjectVersionKind::COV_None) {
+Value *getAMDGPUABIVersion(CodeGenFunction ) {
 StringRef Name = "__oclc_ABI_version";
 auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name);
 if (!ABIVersionC)
@@ -17844,8 +17831,24 @@ Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , 
unsigned Index) {
 // This load will be eliminated by the IPSCCP because it is constant
 // weak_odr without externally_initialized. Either changing it to weak or
 // adding externally_initialized will keep the load.
-Value *ABIVersion = CGF.Builder.CreateAlignedLoad(CGF.Int32Ty,