[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)
jhuber6 wrote: > I think we would be better off teaching an IR optimizer pass to recognize the > divide pattern and remap it to the load from the new location, rather than > forcing the complexity into every frontend That's fair. I would've argued that this version should've been the builtin and the grid size be the computed one but it's definitely not ideal to have multiple versions of this. I'll try to find a place to do this peephole optimization. https://github.com/llvm/llvm-project/pull/83927 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)
https://github.com/arsenm edited https://github.com/llvm/llvm-project/pull/83927 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)
https://github.com/arsenm commented: I think we would be better off teaching an IR optimizer pass to recognize the divide pattern and remap it to the load from the location, rather than forcing the complexity into every frontend https://github.com/llvm/llvm-project/pull/83927 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)
https://github.com/jhuber6 updated https://github.com/llvm/llvm-project/pull/83927 >From 56059fdb5a0e22f8c7dcce6642899fdccf77a55b Mon Sep 17 00:00:00 2001 From: Joseph Huber Date: Mon, 4 Mar 2024 17:27:28 -0600 Subject: [PATCH] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin Summary: The AMDGPU traget was originally designed with OpenCL in mind. The first verisions only provided the grid size, which is the total numver of threads in the execution context. In order to get the number of "blocks" in the CUDA sense you then had to divide by the number of threads in the current work group. The switch from COV4 to COV5 changed the way these arguments are encoded and added a new offset for the "block" size. This patch introduces code to access this directly instead. The name was chosen at `num_workgroups` as the OpenCL standard doesn't seem to have a good name for this concept and calling them "blocks" is just CUDA (even though they're the same thing). This patch also provides support for the old COV4 format by doing the divide of the grid and workgroup sizes. This is so we can switch over to this in the OpenMP runtime even though it's not the officially supported version anymore. I tested this using my libc utilities on both versions and it functioned as expected. --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 4 + clang/lib/CodeGen/CGBuiltin.cpp | 99 +-- clang/test/CodeGen/amdgpu-abi-version.c | 174 +-- clang/test/CodeGenOpenCL/builtins-amdgcn.cl | 18 ++ 4 files changed, 264 insertions(+), 31 deletions(-) diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 213311b96df74f..43f3f500bf8056 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -42,6 +42,10 @@ BUILTIN(__builtin_amdgcn_workgroup_size_x, "Us", "nc") BUILTIN(__builtin_amdgcn_workgroup_size_y, "Us", "nc") BUILTIN(__builtin_amdgcn_workgroup_size_z, "Us", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_x, "Ui", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_y, "Ui", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_z, "Ui", "nc") + BUILTIN(__builtin_amdgcn_grid_size_x, "Ui", "nc") BUILTIN(__builtin_amdgcn_grid_size_y, "Ui", "nc") BUILTIN(__builtin_amdgcn_grid_size_z, "Ui", "nc") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 9ee51ca7142c77..f2f1fc1abbda92 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -17816,36 +17816,39 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) { return Call; } -// \p Index is 0, 1, and 2 for x, y, and z dimension, respectively. +/// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by +/// clang during compilation of user code. +Value *getAMDGPUABIVersion(CodeGenFunction ) { + StringRef Name = "__oclc_ABI_version"; + auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name); + if (!ABIVersionC) +ABIVersionC = new llvm::GlobalVariable( +CGF.CGM.getModule(), CGF.Int32Ty, false, +llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr, +llvm::GlobalVariable::NotThreadLocal, +CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant)); + + // This load will be eliminated by the IPSCCP because it is constant + // weak_odr without externally_initialized. Either changing it to weak or + // adding externally_initialized will keep the load. + return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC, + CGF.CGM.getIntAlign()); +} + /// Emit code based on Code Object ABI version. /// COV_4: Emit code to use dispatch ptr /// COV_5+ : Emit code to use implicitarg ptr /// COV_NONE : Emit code to load a global variable "__oclc_ABI_version" ///and use its value for COV_4 or COV_5+ approach. It is used for ///compiling device libraries in an ABI-agnostic way. -/// -/// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by -/// clang during compilation of user code. Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) { + assert(Index < 3 && "Invalid dimension argument"); llvm::LoadInst *LD; auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion; if (Cov == CodeObjectVersionKind::COV_None) { -StringRef Name = "__oclc_ABI_version"; -auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name); -if (!ABIVersionC) - ABIVersionC = new llvm::GlobalVariable( - CGF.CGM.getModule(), CGF.Int32Ty, false, - llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr, - llvm::GlobalVariable::NotThreadLocal, - CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant)); - -// This load will be eliminated by the IPSCCP because it is constant -// weak_odr without externally_initialized. Either changing it
[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)
github-actions[bot] wrote: :warning: C/C++ code formatter, clang-format found issues in your code. :warning: You can test this locally with the following command: ``bash git-clang-format --diff 32e2294b8abba6b70356aa37b65acf155506d457 b7090a51b8322cc1c05f5a05894fef5a56dcbcf7 -- clang/lib/CodeGen/CGBuiltin.cpp clang/test/CodeGen/amdgpu-abi-version.c `` View the diff from clang-format here. ``diff diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index fac1921d0f..f2f1fc1abb 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -17819,20 +17819,20 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) { /// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by /// clang during compilation of user code. Value *getAMDGPUABIVersion(CodeGenFunction ) { -StringRef Name = "__oclc_ABI_version"; -auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name); -if (!ABIVersionC) - ABIVersionC = new llvm::GlobalVariable( - CGF.CGM.getModule(), CGF.Int32Ty, false, - llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr, - llvm::GlobalVariable::NotThreadLocal, - CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant)); - -// This load will be eliminated by the IPSCCP because it is constant -// weak_odr without externally_initialized. Either changing it to weak or -// adding externally_initialized will keep the load. -return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC, - CGF.CGM.getIntAlign()); + StringRef Name = "__oclc_ABI_version"; + auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name); + if (!ABIVersionC) +ABIVersionC = new llvm::GlobalVariable( +CGF.CGM.getModule(), CGF.Int32Ty, false, +llvm::GlobalValue::ExternalLinkage, nullptr, Name, nullptr, +llvm::GlobalVariable::NotThreadLocal, +CGF.CGM.getContext().getTargetAddressSpace(LangAS::opencl_constant)); + + // This load will be eliminated by the IPSCCP because it is constant + // weak_odr without externally_initialized. Either changing it to weak or + // adding externally_initialized will keep the load. + return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC, + CGF.CGM.getIntAlign()); } /// Emit code based on Code Object ABI version. `` https://github.com/llvm/llvm-project/pull/83927 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)
llvmbot wrote: @llvm/pr-subscribers-backend-amdgpu Author: Joseph Huber (jhuber6) Changes Summary: The AMDGPU traget was originally designed with OpenCL in mind. The first verisions only provided the grid size, which is the total numver of threads in the execution context. In order to get the number of "blocks" in the CUDA sense you then had to divide by the number of threads in the current work group. The switch from COV4 to COV5 changed the way these arguments are encoded and added a new offset for the "block" size. This patch introduces code to access this directly instead. The name was chosen at `num_workgroups` as the OpenCL standard doesn't seem to have a good name for this concept and calling them "blocks" is just CUDA (even though they're the same thing). This patch also provides support for the old COV4 format by doing the divide of the grid and workgroup sizes. This is so we can switch over to this in the OpenMP runtime even though it's not the officially supported version anymore. I tested this using my libc utilities on both versions and it functioned as expected. --- Patch is 20.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/83927.diff 4 Files Affected: - (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+4) - (modified) clang/lib/CodeGen/CGBuiltin.cpp (+79-16) - (modified) clang/test/CodeGen/amdgpu-abi-version.c (+161-13) - (modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+18) ``diff diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 213311b96df74f..43f3f500bf8056 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -42,6 +42,10 @@ BUILTIN(__builtin_amdgcn_workgroup_size_x, "Us", "nc") BUILTIN(__builtin_amdgcn_workgroup_size_y, "Us", "nc") BUILTIN(__builtin_amdgcn_workgroup_size_z, "Us", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_x, "Ui", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_y, "Ui", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_z, "Ui", "nc") + BUILTIN(__builtin_amdgcn_grid_size_x, "Ui", "nc") BUILTIN(__builtin_amdgcn_grid_size_y, "Ui", "nc") BUILTIN(__builtin_amdgcn_grid_size_z, "Ui", "nc") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 9ee51ca7142c77..fac1921d0f8a4a 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -17816,22 +17816,9 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) { return Call; } -// \p Index is 0, 1, and 2 for x, y, and z dimension, respectively. -/// Emit code based on Code Object ABI version. -/// COV_4: Emit code to use dispatch ptr -/// COV_5+ : Emit code to use implicitarg ptr -/// COV_NONE : Emit code to load a global variable "__oclc_ABI_version" -///and use its value for COV_4 or COV_5+ approach. It is used for -///compiling device libraries in an ABI-agnostic way. -/// /// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by /// clang during compilation of user code. -Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) { - llvm::LoadInst *LD; - - auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion; - - if (Cov == CodeObjectVersionKind::COV_None) { +Value *getAMDGPUABIVersion(CodeGenFunction ) { StringRef Name = "__oclc_ABI_version"; auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name); if (!ABIVersionC) @@ -17844,8 +17831,24 @@ Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) { // This load will be eliminated by the IPSCCP because it is constant // weak_odr without externally_initialized. Either changing it to weak or // adding externally_initialized will keep the load. -Value *ABIVersion = CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC, - CGF.CGM.getIntAlign()); +return CGF.Builder.CreateAlignedLoad(CGF.Int32Ty, ABIVersionC, + CGF.CGM.getIntAlign()); +} + +/// Emit code based on Code Object ABI version. +/// COV_4: Emit code to use dispatch ptr +/// COV_5+ : Emit code to use implicitarg ptr +/// COV_NONE : Emit code to load a global variable "__oclc_ABI_version" +///and use its value for COV_4 or COV_5+ approach. It is used for +///compiling device libraries in an ABI-agnostic way. +Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) { + assert(Index < 3 && "Invalid dimension argument"); + llvm::LoadInst *LD; + + auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion; + + if (Cov == CodeObjectVersionKind::COV_None) { +Value *ABIVersion = getAMDGPUABIVersion(CGF); Value *IsCOV5 = CGF.Builder.CreateICmpSGE( ABIVersion, @@ -17901,6 +17904,58 @@ Value *EmitAMDGPUGridSize(CodeGenFunction , unsigned Index) {
[clang] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin (PR #83927)
https://github.com/jhuber6 created https://github.com/llvm/llvm-project/pull/83927 Summary: The AMDGPU traget was originally designed with OpenCL in mind. The first verisions only provided the grid size, which is the total numver of threads in the execution context. In order to get the number of "blocks" in the CUDA sense you then had to divide by the number of threads in the current work group. The switch from COV4 to COV5 changed the way these arguments are encoded and added a new offset for the "block" size. This patch introduces code to access this directly instead. The name was chosen at `num_workgroups` as the OpenCL standard doesn't seem to have a good name for this concept and calling them "blocks" is just CUDA (even though they're the same thing). This patch also provides support for the old COV4 format by doing the divide of the grid and workgroup sizes. This is so we can switch over to this in the OpenMP runtime even though it's not the officially supported version anymore. I tested this using my libc utilities on both versions and it functioned as expected. >From b7090a51b8322cc1c05f5a05894fef5a56dcbcf7 Mon Sep 17 00:00:00 2001 From: Joseph Huber Date: Mon, 4 Mar 2024 17:27:28 -0600 Subject: [PATCH] [AMDGPU] Introduce 'amdgpu_num_workgroups_{xyz}' builtin Summary: The AMDGPU traget was originally designed with OpenCL in mind. The first verisions only provided the grid size, which is the total numver of threads in the execution context. In order to get the number of "blocks" in the CUDA sense you then had to divide by the number of threads in the current work group. The switch from COV4 to COV5 changed the way these arguments are encoded and added a new offset for the "block" size. This patch introduces code to access this directly instead. The name was chosen at `num_workgroups` as the OpenCL standard doesn't seem to have a good name for this concept and calling them "blocks" is just CUDA (even though they're the same thing). This patch also provides support for the old COV4 format by doing the divide of the grid and workgroup sizes. This is so we can switch over to this in the OpenMP runtime even though it's not the officially supported version anymore. I tested this using my libc utilities on both versions and it functioned as expected. --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 4 + clang/lib/CodeGen/CGBuiltin.cpp | 95 -- clang/test/CodeGen/amdgpu-abi-version.c | 174 +-- clang/test/CodeGenOpenCL/builtins-amdgcn.cl | 18 ++ 4 files changed, 262 insertions(+), 29 deletions(-) diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 213311b96df74f..43f3f500bf8056 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -42,6 +42,10 @@ BUILTIN(__builtin_amdgcn_workgroup_size_x, "Us", "nc") BUILTIN(__builtin_amdgcn_workgroup_size_y, "Us", "nc") BUILTIN(__builtin_amdgcn_workgroup_size_z, "Us", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_x, "Ui", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_y, "Ui", "nc") +BUILTIN(__builtin_amdgcn_num_workgroups_z, "Ui", "nc") + BUILTIN(__builtin_amdgcn_grid_size_x, "Ui", "nc") BUILTIN(__builtin_amdgcn_grid_size_y, "Ui", "nc") BUILTIN(__builtin_amdgcn_grid_size_z, "Ui", "nc") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 9ee51ca7142c77..fac1921d0f8a4a 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -17816,22 +17816,9 @@ Value *EmitAMDGPUImplicitArgPtr(CodeGenFunction ) { return Call; } -// \p Index is 0, 1, and 2 for x, y, and z dimension, respectively. -/// Emit code based on Code Object ABI version. -/// COV_4: Emit code to use dispatch ptr -/// COV_5+ : Emit code to use implicitarg ptr -/// COV_NONE : Emit code to load a global variable "__oclc_ABI_version" -///and use its value for COV_4 or COV_5+ approach. It is used for -///compiling device libraries in an ABI-agnostic way. -/// /// Note: "__oclc_ABI_version" is supposed to be emitted and intialized by /// clang during compilation of user code. -Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) { - llvm::LoadInst *LD; - - auto Cov = CGF.getTarget().getTargetOpts().CodeObjectVersion; - - if (Cov == CodeObjectVersionKind::COV_None) { +Value *getAMDGPUABIVersion(CodeGenFunction ) { StringRef Name = "__oclc_ABI_version"; auto *ABIVersionC = CGF.CGM.getModule().getNamedGlobal(Name); if (!ABIVersionC) @@ -17844,8 +17831,24 @@ Value *EmitAMDGPUWorkGroupSize(CodeGenFunction , unsigned Index) { // This load will be eliminated by the IPSCCP because it is constant // weak_odr without externally_initialized. Either changing it to weak or // adding externally_initialized will keep the load. -Value *ABIVersion = CGF.Builder.CreateAlignedLoad(CGF.Int32Ty,