[clang] [llvm] [CUDA] Mark CUDA-12.4 as supported and introduce ptx 8.4. (PR #91516)
https://github.com/Artem-B closed https://github.com/llvm/llvm-project/pull/91516 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [CUDA] Mark CUDA-12.4 as supported and introduce ptx 8.4. (PR #91516)
https://github.com/Artem-B created https://github.com/llvm/llvm-project/pull/91516 None >From 6bb4800a5ed7c5f2ffeaded874d72f7624539122 Mon Sep 17 00:00:00 2001 From: Artem Belevich Date: Wed, 8 May 2024 11:07:34 -0700 Subject: [PATCH] [CUDA] Mark CUDA-12.4 as supported and introduce ptx 8.4. --- clang/docs/ReleaseNotes.rst | 1 + clang/include/clang/Basic/BuiltinsNVPTX.def | 5 - clang/include/clang/Basic/Cuda.h| 3 ++- clang/lib/Basic/Cuda.cpp| 5 +++-- clang/lib/Driver/ToolChains/Cuda.cpp| 3 +++ llvm/lib/Target/NVPTX/NVPTX.td | 2 +- 6 files changed, 14 insertions(+), 5 deletions(-) diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst index 0f9728c00e648..a3c8e4141ca54 100644 --- a/clang/docs/ReleaseNotes.rst +++ b/clang/docs/ReleaseNotes.rst @@ -798,6 +798,7 @@ CUDA/HIP Language Changes CUDA Support +- Clang now supports CUDA SDK up to 12.4 AIX Support ^^^ diff --git a/clang/include/clang/Basic/BuiltinsNVPTX.def b/clang/include/clang/Basic/BuiltinsNVPTX.def index 8d3c5e69d55cf..9e243d740ed7a 100644 --- a/clang/include/clang/Basic/BuiltinsNVPTX.def +++ b/clang/include/clang/Basic/BuiltinsNVPTX.def @@ -61,7 +61,9 @@ #pragma push_macro("PTX81") #pragma push_macro("PTX82") #pragma push_macro("PTX83") -#define PTX83 "ptx83" +#pragma push_macro("PTX84") +#define PTX84 "ptx84" +#define PTX83 "ptx83|" PTX84 #define PTX82 "ptx82|" PTX83 #define PTX81 "ptx81|" PTX82 #define PTX80 "ptx80|" PTX81 @@ -1091,3 +1093,4 @@ TARGET_BUILTIN(__nvvm_getctarank_shared_cluster, "iv*3", "", AND(SM_90,PTX78)) #pragma pop_macro("PTX81") #pragma pop_macro("PTX82") #pragma pop_macro("PTX83") +#pragma pop_macro("PTX84") diff --git a/clang/include/clang/Basic/Cuda.h b/clang/include/clang/Basic/Cuda.h index ba0e4465a0f5a..2d67c4181d129 100644 --- a/clang/include/clang/Basic/Cuda.h +++ b/clang/include/clang/Basic/Cuda.h @@ -41,9 +41,10 @@ enum class CudaVersion { CUDA_121, CUDA_122, CUDA_123, + CUDA_124, FULLY_SUPPORTED = CUDA_123, PARTIALLY_SUPPORTED = - CUDA_123, // Partially supported. Proceed with a warning. + CUDA_124, // Partially supported. Proceed with a warning. NEW = 1, // Too new. Issue a warning, but allow using it. }; const char *CudaVersionToString(CudaVersion V); diff --git a/clang/lib/Basic/Cuda.cpp b/clang/lib/Basic/Cuda.cpp index 113483db5729b..e8ce15eb0decb 100644 --- a/clang/lib/Basic/Cuda.cpp +++ b/clang/lib/Basic/Cuda.cpp @@ -14,7 +14,7 @@ struct CudaVersionMapEntry { }; #define CUDA_ENTRY(major, minor) \ { \ -#major "." #minor, CudaVersion::CUDA_##major##minor, \ +#major "." #minor, CudaVersion::CUDA_##major##minor, \ llvm::VersionTuple(major, minor) \ } @@ -41,6 +41,7 @@ static const CudaVersionMapEntry CudaNameVersionMap[] = { CUDA_ENTRY(12, 1), CUDA_ENTRY(12, 2), CUDA_ENTRY(12, 3), +CUDA_ENTRY(12, 4), {"", CudaVersion::NEW, llvm::VersionTuple(std::numeric_limits::max())}, {"unknown", CudaVersion::UNKNOWN, {}} // End of list tombstone. }; @@ -241,7 +242,7 @@ CudaVersion MaxVersionForCudaArch(CudaArch A) { } } -bool CudaFeatureEnabled(llvm::VersionTuple Version, CudaFeature Feature) { +bool CudaFeatureEnabled(llvm::VersionTuple Version, CudaFeature Feature) { return CudaFeatureEnabled(ToCudaVersion(Version), Feature); } diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp index 6634e6d818b33..d5f93c9c830fa 100644 --- a/clang/lib/Driver/ToolChains/Cuda.cpp +++ b/clang/lib/Driver/ToolChains/Cuda.cpp @@ -82,6 +82,8 @@ CudaVersion getCudaVersion(uint32_t raw_version) { return CudaVersion::CUDA_122; if (raw_version < 12040) return CudaVersion::CUDA_123; + if (raw_version < 12050) +return CudaVersion::CUDA_124; return CudaVersion::NEW; } @@ -688,6 +690,7 @@ void NVPTX::getNVPTXTargetFeatures(const Driver , const llvm::Triple , case CudaVersion::CUDA_##CUDA_VER: \ PtxFeature = "+ptx" #PTX_VER; \ break; +CASE_CUDA_VERSION(124, 84); CASE_CUDA_VERSION(123, 83); CASE_CUDA_VERSION(122, 82); CASE_CUDA_VERSION(121, 81); diff --git a/llvm/lib/Target/NVPTX/NVPTX.td b/llvm/lib/Target/NVPTX/NVPTX.td index 6aa98543e5e22..05457c71cd392 100644 --- a/llvm/lib/Target/NVPTX/NVPTX.td +++ b/llvm/lib/Target/NVPTX/NVPTX.td @@ -41,7 +41,7 @@ foreach sm = [20, 21, 30, 32, 35, 37, 50, 52, 53, def SM90a: FeatureSM<"90a", 901>; foreach version = [32, 40, 41, 42, 43, 50, 60, 61, 62, 63, 64, 65, - 70, 71, 72, 73, 74, 75, 76, 77, 78, 80, 81, 82, 83] in + 70, 71, 72, 73, 74, 75, 76, 77,
[clang] [CUDA][HIP] Fix record layout on Windows (PR #87651)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/87651 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Rename SM_32 to SM_32_ to work around AIX headers (PR #88779)
@@ -86,7 +88,7 @@ static const CudaArchToStringMap arch_names[] = { // clang-format off {CudaArch::UNUSED, "", ""}, SM2(20, "compute_20"), SM2(21, "compute_20"), // Fermi -SM(30), SM(32), SM(35), SM(37), // Kepler +SM(30), SM3(32, "compute_32"), SM(35), SM(37), // Kepler Artem-B wrote: Nit. We don't really need SM3 here. For one-off we could Just use `{CudaArch::SM_32_, "sm_32" , "compute_32"}}` https://github.com/llvm/llvm-project/pull/88779 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Rename SM_32 to SM_32_ to work around AIX headers (PR #88779)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/88779 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Fix name conflict with `sys/mac.h` on AIX (PR #88644)
@@ -50,6 +50,10 @@ const char *CudaVersionToString(CudaVersion V); // Input is "Major.Minor" CudaVersion CudaStringToVersion(const llvm::Twine ); +// We have a name conflict with sys/mac.h on AIX +#ifdef SM_32 +#undef SM_32 +#endif Artem-B wrote: SGTM. Thank you for taking care of this issue. On a side note, do we know if there's a way to file a bug for AIX? They should not be setting macros with names that could conceivably be defined by a user. In theory. I think normally they should be double-underscore-prefixed. https://github.com/llvm/llvm-project/pull/88644 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Fix name conflict with `sys/mac.h` on AIX (PR #88644)
@@ -50,6 +50,10 @@ const char *CudaVersionToString(CudaVersion V); // Input is "Major.Minor" CudaVersion CudaStringToVersion(const llvm::Twine ); +// We have a name conflict with sys/mac.h on AIX +#ifdef SM_32 +#undef SM_32 +#endif Artem-B wrote: Deprecating and removing support for old GPUs needs to be done, but it's not going to happen here and now, so we still need a better short-term fix. Undefining a macro set by external headers is not it. https://github.com/llvm/llvm-project/pull/88644 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Fix name conflict with `sys/mac.h` on AIX (PR #88644)
@@ -50,6 +50,10 @@ const char *CudaVersionToString(CudaVersion V); // Input is "Major.Minor" CudaVersion CudaStringToVersion(const llvm::Twine ); +// We have a name conflict with sys/mac.h on AIX +#ifdef SM_32 +#undef SM_32 +#endif Artem-B wrote: > We could always just make all of these lower case instead? That would be odd. LLVM style wants them to be CamelCased. This enum is rarely used, so renaming them to something more CUDA/NVPTXspecific would be best, IMO. E.g `NVSM_32` Or we could rename only `SM_32`. The constant is rather inconsequential and is used in a few places only. Renaming it to `_SM_32` with a comment that AIX headers have `SM_32` defined. https://github.com/llvm/llvm-project/pull/88644 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Fix name conflict with `sys/mac.h` on AIX (PR #88644)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/88644 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Fix name conflict with `sys/mac.h` on AIX (PR #88644)
@@ -50,6 +50,10 @@ const char *CudaVersionToString(CudaVersion V); // Input is "Major.Minor" CudaVersion CudaStringToVersion(const llvm::Twine ); +// We have a name conflict with sys/mac.h on AIX +#ifdef SM_32 +#undef SM_32 +#endif Artem-B wrote: Ugh. What could possibly go wrong, if someone who needed the original definition of SM_32 ends up transitively including this header and losing the macro definition? A beeter way to handle it as a workaround would be to push the macro definition, undef it, and then pop it back at the end of the header. Even better would be to add prefixes to the macros and/or the enum here to disambiguate them https://github.com/llvm/llvm-project/pull/88644 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang] Introduce `SemaCUDA` (PR #88559)
https://github.com/Artem-B approved this pull request. LGTM. The changes appear to be mechanical in nature, so `check clang` tests should be sufficient to verify we've re-connected things correctly. https://github.com/llvm/llvm-project/pull/88559 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Offload] Do not pass `-fcf-protection=` for offloading (PR #88402)
@@ -6867,8 +6867,14 @@ void Clang::ConstructJob(Compilation , const JobAction , CmdArgs.push_back("-nogpulib"); if (Arg *A = Args.getLastArg(options::OPT_fcf_protection_EQ)) { -CmdArgs.push_back( -Args.MakeArgString(Twine("-fcf-protection=") + A->getValue())); +// Do not pass this argument to the offloading device if the target does not +// support it. +// TODO: We need a better way to detect incompatible options for offloading. +if (JA.getOffloadingDeviceKind() == Action::OFK_None || +(!TC.getTriple().isAMDGPU() && !TC.getTriple().isNVPTX() && + !TC.getTriple().isSPIRV())) Artem-B wrote: +1. We have grown too many offloading cases all over the place over time. It was fine when there was only CUDA/NVPTX, was sort of OK when AMDGPU got added, now it gets to be a bit too much. https://github.com/llvm/llvm-project/pull/88402 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Offload] Do not pass `-fcf-protection=` for offloading (PR #88402)
https://github.com/Artem-B commented: LGTM in principle. https://github.com/llvm/llvm-project/pull/88402 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Offload] Do not pass `-fcf-protection=` for offloading (PR #88402)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/88402 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Offload] Do not pass `-fcf-protection=` for offloading (PR #88402)
@@ -6867,8 +6867,14 @@ void Clang::ConstructJob(Compilation , const JobAction , CmdArgs.push_back("-nogpulib"); if (Arg *A = Args.getLastArg(options::OPT_fcf_protection_EQ)) { -CmdArgs.push_back( -Args.MakeArgString(Twine("-fcf-protection=") + A->getValue())); +// Do not pass this argument to the offloading device if the target does not +// support it. +// TODO: We need a better way to detect incompatible options for offloading. +if (JA.getOffloadingDeviceKind() == Action::OFK_None || +(!TC.getTriple().isAMDGPU() && !TC.getTriple().isNVPTX() && Artem-B wrote: Nit: I'd collapse negations into one: ``` !(TC.getTriple().isAMDGPU() || TC.getTriple().isNVPTX() || TC.getTriple().isSPIRV()) ``` https://github.com/llvm/llvm-project/pull/88402 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [Offload] Change unregister library to use `atexit` instead of destructor (PR #86830)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/86830 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [Offload] Change unregister library to use `atexit` instead of destructor (PR #86830)
@@ -186,57 +186,62 @@ GlobalVariable *createBinDesc(Module , ArrayRef> Bufs, ".omp_offloading.descriptor" + Suffix); } -void createRegisterFunction(Module , GlobalVariable *BinDesc, -StringRef Suffix) { +Function *createUnregisterFunction(Module , GlobalVariable *BinDesc, + StringRef Suffix) { LLVMContext = M.getContext(); auto *FuncTy = FunctionType::get(Type::getVoidTy(C), /*isVarArg*/ false); - auto *Func = Function::Create(FuncTy, GlobalValue::InternalLinkage, -".omp_offloading.descriptor_reg" + Suffix, ); + auto *Func = + Function::Create(FuncTy, GlobalValue::InternalLinkage, + ".omp_offloading.descriptor_unreg" + Suffix, ); Func->setSection(".text.startup"); - // Get __tgt_register_lib function declaration. - auto *RegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), - /*isVarArg*/ false); - FunctionCallee RegFuncC = - M.getOrInsertFunction("__tgt_register_lib", RegFuncTy); + // Get __tgt_unregister_lib function declaration. + auto *UnRegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), +/*isVarArg*/ false); + FunctionCallee UnRegFuncC = + M.getOrInsertFunction("__tgt_unregister_lib", UnRegFuncTy); // Construct function body IRBuilder<> Builder(BasicBlock::Create(C, "entry", Func)); - Builder.CreateCall(RegFuncC, BinDesc); + Builder.CreateCall(UnRegFuncC, BinDesc); Builder.CreateRetVoid(); - // Add this function to constructors. - // Set priority to 1 so that __tgt_register_lib is executed AFTER - // __tgt_register_requires (we want to know what requirements have been - // asked for before we load a libomptarget plugin so that by the time the - // plugin is loaded it can report how many devices there are which can - // satisfy these requirements). - appendToGlobalCtors(M, Func, /*Priority*/ 1); + return Func; } -void createUnregisterFunction(Module , GlobalVariable *BinDesc, - StringRef Suffix) { +void createRegisterFunction(Module , GlobalVariable *BinDesc, +StringRef Suffix) { LLVMContext = M.getContext(); auto *FuncTy = FunctionType::get(Type::getVoidTy(C), /*isVarArg*/ false); - auto *Func = - Function::Create(FuncTy, GlobalValue::InternalLinkage, - ".omp_offloading.descriptor_unreg" + Suffix, ); + auto *Func = Function::Create(FuncTy, GlobalValue::InternalLinkage, +".omp_offloading.descriptor_reg" + Suffix, ); Func->setSection(".text.startup"); - // Get __tgt_unregister_lib function declaration. - auto *UnRegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), -/*isVarArg*/ false); - FunctionCallee UnRegFuncC = - M.getOrInsertFunction("__tgt_unregister_lib", UnRegFuncTy); + // Get __tgt_register_lib function declaration. + auto *RegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), + /*isVarArg*/ false); + FunctionCallee RegFuncC = + M.getOrInsertFunction("__tgt_register_lib", RegFuncTy); + + auto *AtExitTy = FunctionType::get( + Type::getInt32Ty(C), PointerType::getUnqual(C), /*isVarArg=*/false); + FunctionCallee AtExit = M.getOrInsertFunction("atexit", AtExitTy); + + Function *UnregFunc = createUnregisterFunction(M, BinDesc, Suffix); // Construct function body IRBuilder<> Builder(BasicBlock::Create(C, "entry", Func)); - Builder.CreateCall(UnRegFuncC, BinDesc); + + // Register the destructors with 'atexit', This is expected by the CUDA Artem-B wrote: Typo. `,` -> `.` https://github.com/llvm/llvm-project/pull/86830 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [Offload] Change unregister library to use `atexit` instead of destructor (PR #86830)
@@ -186,57 +186,62 @@ GlobalVariable *createBinDesc(Module , ArrayRef> Bufs, ".omp_offloading.descriptor" + Suffix); } -void createRegisterFunction(Module , GlobalVariable *BinDesc, -StringRef Suffix) { +Function *createUnregisterFunction(Module , GlobalVariable *BinDesc, + StringRef Suffix) { LLVMContext = M.getContext(); auto *FuncTy = FunctionType::get(Type::getVoidTy(C), /*isVarArg*/ false); - auto *Func = Function::Create(FuncTy, GlobalValue::InternalLinkage, -".omp_offloading.descriptor_reg" + Suffix, ); + auto *Func = + Function::Create(FuncTy, GlobalValue::InternalLinkage, + ".omp_offloading.descriptor_unreg" + Suffix, ); Func->setSection(".text.startup"); - // Get __tgt_register_lib function declaration. - auto *RegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), - /*isVarArg*/ false); - FunctionCallee RegFuncC = - M.getOrInsertFunction("__tgt_register_lib", RegFuncTy); + // Get __tgt_unregister_lib function declaration. + auto *UnRegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), +/*isVarArg*/ false); + FunctionCallee UnRegFuncC = + M.getOrInsertFunction("__tgt_unregister_lib", UnRegFuncTy); // Construct function body IRBuilder<> Builder(BasicBlock::Create(C, "entry", Func)); - Builder.CreateCall(RegFuncC, BinDesc); + Builder.CreateCall(UnRegFuncC, BinDesc); Builder.CreateRetVoid(); - // Add this function to constructors. - // Set priority to 1 so that __tgt_register_lib is executed AFTER - // __tgt_register_requires (we want to know what requirements have been - // asked for before we load a libomptarget plugin so that by the time the - // plugin is loaded it can report how many devices there are which can - // satisfy these requirements). - appendToGlobalCtors(M, Func, /*Priority*/ 1); + return Func; } -void createUnregisterFunction(Module , GlobalVariable *BinDesc, - StringRef Suffix) { +void createRegisterFunction(Module , GlobalVariable *BinDesc, +StringRef Suffix) { LLVMContext = M.getContext(); auto *FuncTy = FunctionType::get(Type::getVoidTy(C), /*isVarArg*/ false); - auto *Func = - Function::Create(FuncTy, GlobalValue::InternalLinkage, - ".omp_offloading.descriptor_unreg" + Suffix, ); + auto *Func = Function::Create(FuncTy, GlobalValue::InternalLinkage, +".omp_offloading.descriptor_reg" + Suffix, ); Func->setSection(".text.startup"); - // Get __tgt_unregister_lib function declaration. - auto *UnRegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), -/*isVarArg*/ false); - FunctionCallee UnRegFuncC = - M.getOrInsertFunction("__tgt_unregister_lib", UnRegFuncTy); + // Get __tgt_register_lib function declaration. + auto *RegFuncTy = FunctionType::get(Type::getVoidTy(C), getBinDescPtrTy(M), + /*isVarArg*/ false); + FunctionCallee RegFuncC = + M.getOrInsertFunction("__tgt_register_lib", RegFuncTy); + + auto *AtExitTy = FunctionType::get( + Type::getInt32Ty(C), PointerType::getUnqual(C), /*isVarArg=*/false); + FunctionCallee AtExit = M.getOrInsertFunction("atexit", AtExitTy); + + Function *UnregFunc = createUnregisterFunction(M, BinDesc, Suffix); // Construct function body IRBuilder<> Builder(BasicBlock::Create(C, "entry", Func)); - Builder.CreateCall(UnRegFuncC, BinDesc); + + // Register the destructors with 'atexit', This is expected by the CUDA Artem-B wrote: > This is expected by the CUDA runtime I'd add a reference to clang/lib/CodeGen/CGCUDANV.cpp which provides some history why we switched to `atexit`. https://github.com/llvm/llvm-project/pull/86830 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [Offload] Change unregister library to use `atexit` instead of destructor (PR #86830)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/86830 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP][NFC] Refactor managed var codegen (PR #85976)
https://github.com/Artem-B approved this pull request. LGTM, sans the "NFC" part in the description. https://github.com/llvm/llvm-project/pull/85976 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP][NFC] Refactor managed var codegen (PR #85976)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/85976 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP][NFC] Refactor managed var codegen (PR #85976)
@@ -1160,9 +1152,8 @@ void CGNVCUDARuntime::createOffloadingEntries() { // Returns module constructor to be added. llvm::Function *CGNVCUDARuntime::finalizeModule() { + transformManagedVars(); Artem-B wrote: This does not look like "NFC" as we now perform the transform for the host compilation, too. I assume we do have existing tests covering generation of the variables. https://github.com/llvm/llvm-project/pull/85976 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [InstCombine] Canonicalize `(sitofp x)` -> `(uitofp x)` if `x >= 0` (PR #82404)
Artem-B wrote: We happen have a back-end where we do not have conversion instructions between unsigned int and FP, so this patch complicates things. Would it make sense to enable this canonicalization only if the target wants it? https://github.com/llvm/llvm-project/pull/82404 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] add --offload-compression-level= option (PR #83605)
@@ -2863,3 +2863,18 @@ void tools::addOutlineAtomicsArgs(const Driver , const ToolChain , CmdArgs.push_back("+outline-atomics"); } } + +void tools::addOffloadCompressArgs(const llvm::opt::ArgList , + llvm::opt::ArgStringList ) { + if (TCArgs.hasFlag(options::OPT_offload_compress, + options::OPT_no_offload_compress, false)) +CmdArgs.push_back("-compress"); + if (TCArgs.hasArg(options::OPT_v)) +CmdArgs.push_back("-verbose"); + if (auto *Arg = + TCArgs.getLastArg(options::OPT_offload_compression_level_EQ)) { +std::string CompressionLevelArg = +std::string("-compression-level=") + Arg->getValue(); +CmdArgs.push_back(TCArgs.MakeArgString(CompressionLevelArg)); Artem-B wrote: This may be collapsed to just ``` CmdArgs.push_back(TCArgs.MakeArgString("-compression-level=" + Arg->getValue()))`. ``` Maybe with a `Twine` or `StringRef` wrapping the string literal. https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] add --offload-compression-level= option (PR #83605)
https://github.com/Artem-B approved this pull request. LGTM. https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] add --offload-compression-level= option (PR #83605)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Include PTX in non-RDC mode using the new driver (PR #84367)
https://github.com/Artem-B approved this pull request. LGTM overall, with docs/comment nits. https://github.com/llvm/llvm-project/pull/84367 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Include PTX in non-RDC mode using the new driver (PR #84367)
@@ -503,18 +503,20 @@ void NVPTX::Assembler::ConstructJob(Compilation , const JobAction , Exec, CmdArgs, Inputs, Output)); } -static bool shouldIncludePTX(const ArgList , const char *gpu_arch) { - bool includePTX = true; - for (Arg *A : Args) { -if (!(A->getOption().matches(options::OPT_cuda_include_ptx_EQ) || - A->getOption().matches(options::OPT_no_cuda_include_ptx_EQ))) - continue; +static bool shouldIncludePTX(const ArgList , StringRef InputArch) { + // The new driver does not include PTX by default. + bool includePTX = !Args.hasFlag(options::OPT_offload_new_driver, Artem-B wrote: I'd add a comment on why we're making this decision based on the new vs old driver. https://github.com/llvm/llvm-project/pull/84367 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Include PTX in non-RDC mode using the new driver (PR #84367)
Artem-B wrote: > > > Should I make `shouldIncludePTX` default to `false` for the new driver? > > > > > > Yes, I think that's a better default. > > Done, now requires `--cuda-include-ptx=`. This may be worth adding to the release notes. https://github.com/llvm/llvm-project/pull/84367 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Include PTX in non-RDC mode using the new driver (PR #84367)
Artem-B wrote: > Should I make `shouldIncludePTX` default to `false` for the new driver? Yes, I think that's a better default. https://github.com/llvm/llvm-project/pull/84367 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Include PTX in non-RDC mode using the new driver (PR #84367)
@@ -4625,7 +4625,15 @@ Action *Driver::BuildOffloadingActions(Compilation , DDeps.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind); OffloadAction::DeviceDependences DDep; DDep.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind); + + // Compiling CUDA in non-RDC mode uses the PTX output if available. Artem-B wrote: Do we still respect `--cuda-include-ptx=...` ? https://github.com/llvm/llvm-project/pull/84367 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Include PTX in non-RDC mode using the new driver (PR #84367)
@@ -4625,7 +4625,15 @@ Action *Driver::BuildOffloadingActions(Compilation , DDeps.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind); OffloadAction::DeviceDependences DDep; DDep.add(*A, *TCAndArch->first, TCAndArch->second.data(), Kind); + + // Compiling CUDA in non-RDC mode uses the PTX output if available. + for (Action *Input : A->getInputs()) +if (Kind == Action::OFK_Cuda && A->getType() == types::TY_Object && +!Args.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc, Artem-B wrote: I'm not quite sure why we would need to include PTX for RDC compilation. In retrospect, including PTX by default with all compilations turned out to be a wrong default choice. It's just a waste of space for most of the users, and it allows problems to go unnoticed for longer than they should (e.g. something was compiled for a wrong GPU). Switching to the new driver is a good point to make a better choice. I would argue that we should not be including PTX by default or, if we do deem that it may be useful, only add it for the most recent chosen GPU variant, to provide some forward compatibility, not for all of them. https://github.com/llvm/llvm-project/pull/84367 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang][CUDA] Disable float128 diagnostics for device compilation (PR #83918)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/83918 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang][CUDA] Disable float128 diagnostics for device compilation (PR #83918)
@@ -0,0 +1,9 @@ +// CPU-side compilation on x86 (no errors expected). +// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -aux-triple nvptx64 -x cuda -fsyntax-only -verify %s + +// GPU-side compilation on x86 (no errors expected) +// RUN: %clang_cc1 -triple nvptx64 -aux-triple x86_64-unknown-linux-gnu -fcuda-is-device -x cuda -fsyntax-only -verify %s Artem-B wrote: I'd add a test verifying that we do emit diagnostics if fp128 is used in the GPU code. It would probably need to be done somewhere in the codegen tests as it will not fire in the syntax-only checks. https://github.com/llvm/llvm-project/pull/83918 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang][CUDA] Disable float128 diagnostics for device compilation (PR #83918)
@@ -4877,7 +4877,9 @@ void Sema::AddModeAttr(Decl *D, const AttributeCommonInfo , NewElemTy = Context.getRealTypeForBitwidth(DestWidth, ExplicitType); if (NewElemTy.isNull()) { -Diag(AttrLoc, diag::err_machine_mode) << 1 /*Unsupported*/ << Name; +// Only emit diagnostic on host for 128-bit mode attribute Artem-B wrote: OK. As long as you're sure that the remaining diag covers all possible uses of fp128 on the GPU, it should be fine. https://github.com/llvm/llvm-project/pull/83918 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Correctly set CUDA default architecture (PR #84017)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/84017 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang][CUDA] Disable float128 diagnostics for device compilation (PR #83918)
@@ -4877,7 +4877,9 @@ void Sema::AddModeAttr(Decl *D, const AttributeCommonInfo , NewElemTy = Context.getRealTypeForBitwidth(DestWidth, ExplicitType); if (NewElemTy.isNull()) { -Diag(AttrLoc, diag::err_machine_mode) << 1 /*Unsupported*/ << Name; +// Only emit diagnostic on host for 128-bit mode attribute Artem-B wrote: > This is going to error out like this: > > ``` > error: 'a' requires 128 bit size '__float128' type support, but target > 'nvptx64-nvidia-cuda' does not support it > ``` Something does not add up. How would we get `target 'nvptx64-nvidia-cuda'` if the diag below only fires if we're compiling for the host? https://github.com/llvm/llvm-project/pull/83918 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [clang][CUDA] Disable float128 diagnostics for device compilation (PR #83918)
@@ -4877,7 +4877,9 @@ void Sema::AddModeAttr(Decl *D, const AttributeCommonInfo , NewElemTy = Context.getRealTypeForBitwidth(DestWidth, ExplicitType); if (NewElemTy.isNull()) { -Diag(AttrLoc, diag::err_machine_mode) << 1 /*Unsupported*/ << Name; +// Only emit diagnostic on host for 128-bit mode attribute Artem-B wrote: What do you expect to see if __float128 is used from a GPU function. Can you check on a toy example. ``` __attribute__((device)) __float128 f(__float128 a, float b) { __float128 c = b + 1.0; return a + c; } ``` https://github.com/llvm/llvm-project/pull/83918 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] add --offload-compression-level= option (PR #83605)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] add --offload-compression-level= option (PR #83605)
@@ -906,6 +906,16 @@ CreateFileHandler(MemoryBuffer , } OffloadBundlerConfig::OffloadBundlerConfig() { + if (llvm::compression::zstd::isAvailable()) { +CompressionFormat = llvm::compression::Format::Zstd; +// Use a high zstd compress level by default for better size reduction. Artem-B wrote: Also, I've just discovered that zstd already has https://github.com/facebook/zstd/blob/b293d2ebc3a5d29309390a70b3e7861b6f5133ec/lib/zstd.h#L394 ``` ZSTD_c_enableLongDistanceMatching=160, /* Enable long distance matching. * This parameter is designed to improve compression ratio * for large inputs, by finding large matches at long distance. * It increases memory usage and window size. * Note: enabling this parameter increases default ZSTD_c_windowLog to 128 MB * except when expressly set to a different value. * Note: will be enabled by default if ZSTD_c_windowLog >= 128 MB and * compression strategy >= ZSTD_btopt (== compression level 16+) */ ``` This sounds like something we could use here. https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host-used external kernel (PR #83870)
@@ -24,6 +24,7 @@ // NEG-NOT: @__clang_gpu_used_external = {{.*}} @_Z7kernel2v // NEG-NOT: @__clang_gpu_used_external = {{.*}} @_Z7kernel3v +// XEG-NOT: @__clang_gpu_used_external = {{.*}} @_Z7kernel5v Artem-B wrote: Did you mean `NEG-NOT` ? https://github.com/llvm/llvm-project/pull/83870 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host-used external kernel (PR #83870)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/83870 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host-used external kernel (PR #83870)
https://github.com/Artem-B approved this pull request. LGTM in principle, but I'd run it by someone with more familiarity with linking quirks. @MaskRay PTAL, when you get a chance. https://github.com/llvm/llvm-project/pull/83870 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] add --offload-compression-level= option (PR #83605)
@@ -906,6 +906,16 @@ CreateFileHandler(MemoryBuffer , } OffloadBundlerConfig::OffloadBundlerConfig() { + if (llvm::compression::zstd::isAvailable()) { +CompressionFormat = llvm::compression::Format::Zstd; +// Use a high zstd compress level by default for better size reduction. Artem-B wrote: I'd add more details here. While higher compression levels usually do improve compression ratio, in typical use case it's an incremental improvement. Here, we do it to achieve dramatic increase in compression ratio by exploiting the fact that we carry multiple sets of very similar large bitcode blobs, and that we need compression level high enough to fit one complete blob into compression window. At least that's the theory. Should we print a warning (or just document it?) when compression level ends up being below of what we'd expect? Considering that good compression starts at zstd-20, I suspect that compression level will go back to ~2.5x if the binary size for one GPU doubles in size and no longer fits. On top of that compression time will also increase, a lot. That will be a rather unpleasant surprise for whoever runs into it. ZSTD's current compression parameters are set this way: https://github.com/facebook/zstd/blob/dev/lib/compress/clevels.h#L47 ``` { 23, 24, 22, 7, 3,256, ZSTD_btultra2}, /* level 19 */ { 25, 25, 23, 7, 3,256, ZSTD_btultra2, /* level 20 */ ``` First three numbers are log2 of (largest match distance, fully searched segment, dispatch table). 2^25 = 32MB which happens to be about the size of the single GPU binary in your example. I'm pretty sure this explains why `zstd-20` works so well on it, while zstd-19 does not. It will work well for the smaller binaries, but I'm pretty sure it will regress for a slightly larger binary. I think it may be worth experimenting with fine-tuning compression settings and instead of blindly setting `zstd-20`, consider the size of the binary we need to deal with, and adjust only windowLog/chainLog appropriately. Or we could set the default to lower compression level + large windowLog. This should still give us most of the compression benefits for the binaries that would fit into the window, but would avoid the performance cliff if the binary is too large. I may be overcomplicating it too much, too. If someone does run into the problem, they now have a way to work around it by tweaking the compression level. https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] add --offload-compression-level= option (PR #83605)
@@ -942,20 +942,28 @@ CompressedOffloadBundle::compress(const llvm::MemoryBuffer , Input.getBuffer().size()); llvm::compression::Format CompressionFormat; + int Level; - if (llvm::compression::zstd::isAvailable()) + if (llvm::compression::zstd::isAvailable()) { CompressionFormat = llvm::compression::Format::Zstd; - else if (llvm::compression::zlib::isAvailable()) +// Use a high zstd compress level by default for better size reduction. +const int DefaultZstdLevel = 20; Artem-B wrote: > compiling kernels to bitcode for 6 GPU takes 30s. compression with zstd level > 20 takes 2s. This looks acceptable for me. > unless zstd can be parallelized. zstd does support multithreaded compression, but enabling it would run into the same issue we had with enabling multi-threaded compilation -- it will interfere with the build system's idea of resource usage. https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [HIP] change compress level (PR #83605)
@@ -942,20 +942,28 @@ CompressedOffloadBundle::compress(const llvm::MemoryBuffer , Input.getBuffer().size()); llvm::compression::Format CompressionFormat; + int Level; - if (llvm::compression::zstd::isAvailable()) + if (llvm::compression::zstd::isAvailable()) { CompressionFormat = llvm::compression::Format::Zstd; - else if (llvm::compression::zlib::isAvailable()) +// Use a high zstd compress level by default for better size reduction. +const int DefaultZstdLevel = 20; Artem-B wrote: What's the default compression level for zstd? It would be great if we could override the compression level. I'm somewhat reluctant to impose max compression level on everyone by default, without any way out, if it turns out to be a problem. @MaskRay WDYT? Max compression level may be fine. If we produce enough stuff for compression to take long, compilation time itself will likely dwarf the compression time. For the small TUs, even slow compression may be fine. @yxsamliu how long the compilation w/o compression takes in your benchmarks? https://github.com/llvm/llvm-project/pull/83605 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host min/max in header (PR #82956)
Artem-B wrote: > Probably I need to define those functions with mixed args by default to avoid > regressions. Are there any other regressions? Can hupCUB be fixed intsead? While their use case is probably benign, I'd rather fix the user code, than propagate CUDA bugs into HIP. https://github.com/llvm/llvm-project/pull/82956 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host min/max in header (PR #82956)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/82956 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host min/max in header (PR #82956)
@@ -1306,15 +1306,73 @@ float min(float __x, float __y) { return __builtin_fminf(__x, __y); } __DEVICE__ double min(double __x, double __y) { return __builtin_fmin(__x, __y); } -#if !defined(__HIPCC_RTC__) && !defined(__OPENMP_AMDGCN__) -__host__ inline static int min(int __arg1, int __arg2) { - return __arg1 < __arg2 ? __arg1 : __arg2; +// Define host min/max functions. +#if !defined(__HIPCC_RTC__) && !defined(__OPENMP_AMDGCN__) && \ +!defined(__HIP_NO_HOST_MIN_MAX_IN_GLOBAL_NAMESPACE__) + +#pragma push_macro("DEFINE_MIN_MAX_FUNCTIONS") +#pragma push_macro("DEFINE_MIN_MAX_FUNCTIONS") +#define DEFINE_MIN_MAX_FUNCTIONS(ret_type, type1, type2) \ + inline ret_type min(const type1 __a, const type2 __b) { \ +return (__a < __b) ? __a : __b; \ + } \ + inline ret_type max(const type1 __a, const type2 __b) { \ +return (__a > __b) ? __a : __b; \ + } + +// Define min and max functions for same type comparisons +DEFINE_MIN_MAX_FUNCTIONS(int, int, int) +DEFINE_MIN_MAX_FUNCTIONS(unsigned int, unsigned int, unsigned int) +DEFINE_MIN_MAX_FUNCTIONS(long, long, long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long, unsigned long, unsigned long) +DEFINE_MIN_MAX_FUNCTIONS(long long, long long, long long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long long, unsigned long long, + unsigned long long) + +// CUDA defines host min/max functions with mixed signed/unsgined integer +// parameters where signed integers are casted to unsigned integers. However, +// this may not be users' intention. Therefore do not define them by default +// unless users specify -D__HIP_DEFINE_MIXED_HOST_MIN_MAX__. Artem-B wrote: Nit: signed integers are implicitly promoted to unsigned ones due to the integer promotion rules. Cast would imply intentional cast and we're not doing that. I'd rephrase it a bit along the lines of: The routines below will perform unsigned comparison, which may produce invalid results if a signed integer was passed unintentionally. We do not want it happen silently, and do not provide these overloads by default. However for compatibility with CUDA, we allow them, if explicitly requested by the user by defining `__HIP_DEFINE_MIXED_HOST_MIN_MAX__`. https://github.com/llvm/llvm-project/pull/82956 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host min/max in header (PR #82956)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/82956 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host min/max in header (PR #82956)
@@ -1306,15 +1306,68 @@ float min(float __x, float __y) { return __builtin_fminf(__x, __y); } __DEVICE__ double min(double __x, double __y) { return __builtin_fmin(__x, __y); } -#if !defined(__HIPCC_RTC__) && !defined(__OPENMP_AMDGCN__) -__host__ inline static int min(int __arg1, int __arg2) { - return __arg1 < __arg2 ? __arg1 : __arg2; +// Define host min/max functions. +#if !defined(__HIPCC_RTC__) && !defined(__OPENMP_AMDGCN__) && \ +!defined(__HIP_NO_HOST_MIN_MAX_IN_GLOBAL_NAMESPACE__) + +#pragma push_macro("DEFINE_MIN_MAX_FUNCTIONS") +#pragma push_macro("DEFINE_MIN_MAX_FUNCTIONS") +#define DEFINE_MIN_MAX_FUNCTIONS(ret_type, type1, type2) \ + static inline ret_type min(const type1 __a, const type2 __b) { \ +return (__a < __b) ? __a : __b; \ + } \ + static inline ret_type max(const type1 __a, const type2 __b) { \ +return (__a > __b) ? __a : __b; \ + } + +// Define min and max functions for same type comparisons +DEFINE_MIN_MAX_FUNCTIONS(int, int, int) +DEFINE_MIN_MAX_FUNCTIONS(unsigned int, unsigned int, unsigned int) +DEFINE_MIN_MAX_FUNCTIONS(long, long, long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long, unsigned long, unsigned long) +DEFINE_MIN_MAX_FUNCTIONS(long long, long long, long long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long long, unsigned long long, + unsigned long long) + +// Define min and max functions for all mixed type comparisons +DEFINE_MIN_MAX_FUNCTIONS(unsigned int, int, unsigned int) +DEFINE_MIN_MAX_FUNCTIONS(unsigned int, unsigned int, int) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long, long, unsigned long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long, unsigned long, long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long long, long long, unsigned long long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long long, unsigned long long, long long) Artem-B wrote: Not everything CUDA does is the right model to follow. This may be one of the cases where we should improve things, if we can, instead of just copying the broken behavior. Not adding problematic things is easier than removing them later, when they are used, intentionally or not. Considering that HIP currently does not have those functions, it would suggest that there is probably no existing HIP code depending on them. Existing cuda code which may need those functions will need some amount of porting to HIP, anyway, so fixing the source code could be done as part of the porting effort. We could put those mixed min/max functions under some preprocessor guard, which would keep them disabled by default. If someone desperately needs them, they would have to specify `-DPLEASE_ENABLE_BROKEN_MINMAX_ON_MIXED_SIGNED_UNSIGNED_TYPES`. https://github.com/llvm/llvm-project/pull/82956 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] fix host min/max in header (PR #82956)
@@ -1306,15 +1306,68 @@ float min(float __x, float __y) { return __builtin_fminf(__x, __y); } __DEVICE__ double min(double __x, double __y) { return __builtin_fmin(__x, __y); } -#if !defined(__HIPCC_RTC__) && !defined(__OPENMP_AMDGCN__) -__host__ inline static int min(int __arg1, int __arg2) { - return __arg1 < __arg2 ? __arg1 : __arg2; +// Define host min/max functions. +#if !defined(__HIPCC_RTC__) && !defined(__OPENMP_AMDGCN__) && \ +!defined(__HIP_NO_HOST_MIN_MAX_IN_GLOBAL_NAMESPACE__) + +#pragma push_macro("DEFINE_MIN_MAX_FUNCTIONS") +#pragma push_macro("DEFINE_MIN_MAX_FUNCTIONS") +#define DEFINE_MIN_MAX_FUNCTIONS(ret_type, type1, type2) \ + static inline ret_type min(const type1 __a, const type2 __b) { \ +return (__a < __b) ? __a : __b; \ + } \ + static inline ret_type max(const type1 __a, const type2 __b) { \ +return (__a > __b) ? __a : __b; \ + } + +// Define min and max functions for same type comparisons +DEFINE_MIN_MAX_FUNCTIONS(int, int, int) +DEFINE_MIN_MAX_FUNCTIONS(unsigned int, unsigned int, unsigned int) +DEFINE_MIN_MAX_FUNCTIONS(long, long, long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long, unsigned long, unsigned long) +DEFINE_MIN_MAX_FUNCTIONS(long long, long long, long long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long long, unsigned long long, + unsigned long long) + +// Define min and max functions for all mixed type comparisons +DEFINE_MIN_MAX_FUNCTIONS(unsigned int, int, unsigned int) +DEFINE_MIN_MAX_FUNCTIONS(unsigned int, unsigned int, int) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long, long, unsigned long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long, unsigned long, long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long long, long long, unsigned long long) +DEFINE_MIN_MAX_FUNCTIONS(unsigned long long, unsigned long long, long long) Artem-B wrote: I assume these are needed in order to avoid errors about ambiguous overload resolution when we pass signed/unsigned arguments. Normally, if we were to use `std::min()` function, the user would have to explicitly cast arguments or use `std::min()` to resolve the issue. Implicitly converting int->unsigned under the hood is probably not a good idea here as we do not know what the user needs/wants and whether it's a WAI or an error. For min/max converting a negative argument into an unsigned would probably be an error. I think we do need to force users to use one of the all-signed or all-unsigned variants here, too, same as with std::min/max. https://github.com/llvm/llvm-project/pull/82956 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [NVPTX] Enable the _Float16 type for NVPTX compilation (PR #82436)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/82436 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang][NVPTX] Allow passing arguments to the linker while standalone (PR #73030)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/73030 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] Allow partial linking for `-fgpu-rdc` (PR #81700)
https://github.com/Artem-B approved this pull request. Overall LGTM. Please wait for @jhuber6's to double check the partial linking mechanics details. https://github.com/llvm/llvm-project/pull/81700 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] Allow partial linking for `-fgpu-rdc` (PR #81700)
@@ -36,6 +47,146 @@ static std::string normalizeForBundler(const llvm::Triple , : T.normalize(); } +// Collect undefined __hip_fatbin* and __hip_gpubin_handle* symbols from all +// input object or archive files. +class HIPUndefinedFatBinSymbols { +public: + HIPUndefinedFatBinSymbols(const Compilation ) + : C(C), DiagID(C.getDriver().getDiags().getCustomDiagID( + DiagnosticsEngine::Error, + "Error collecting HIP undefined fatbin symbols: %0")), +Quiet(C.getArgs().hasArg(options::OPT__HASH_HASH_HASH)), +Verbose(C.getArgs().hasArg(options::OPT_v)) { +populateSymbols(); +if (Verbose) { + for (auto Name : FatBinSymbols) +llvm::errs() << "Found undefined HIP fatbin symbol: " << Name << "\n"; + for (auto Name : GPUBinHandleSymbols) +llvm::errs() << "Found undefined HIP gpubin handle symbol: " << Name + << "\n"; +} + } + + const std::set () const { +return FatBinSymbols; + } + + const std::set () const { +return GPUBinHandleSymbols; + } + +private: + const Compilation + unsigned DiagID; + bool Quiet; + bool Verbose; + std::set FatBinSymbols; + std::set GPUBinHandleSymbols; + const std::string FatBinPrefix = "__hip_fatbin"; + const std::string GPUBinHandlePrefix = "__hip_gpubin_handle"; + + void populateSymbols() { +std::deque WorkList; +std::set Visited; + +for (const auto : C.getActions()) { + WorkList.push_back(Action); +} + +while (!WorkList.empty()) { + const Action *CurrentAction = WorkList.front(); + WorkList.pop_front(); + + if (!CurrentAction || !Visited.insert(CurrentAction).second) +continue; + + if (const auto *IA = dyn_cast(CurrentAction)) { +std::string ID = IA->getId().str(); +if (!ID.empty()) { + ID = llvm::utohexstr(llvm::MD5Hash(ID), /*LowerCase=*/true); + FatBinSymbols.insert(Twine(FatBinPrefix + "_" + ID).str()); + GPUBinHandleSymbols.insert( + Twine(GPUBinHandlePrefix + "_" + ID).str()); + continue; +} +const char *Filename = IA->getInputArg().getValue(); +auto BufferOrErr = llvm::MemoryBuffer::getFile(Filename); +// Input action could be options to linker, therefore ignore it +// if cannot read it. +if (!BufferOrErr) + continue; + +processInput(BufferOrErr.get()->getMemBufferRef()); + } else +WorkList.insert(WorkList.end(), CurrentAction->getInputs().begin(), +CurrentAction->getInputs().end()); +} + } + + void processInput(const llvm::MemoryBufferRef ) { +// Try processing as object file first. +auto ObjFileOrErr = llvm::object::ObjectFile::createObjectFile(Buffer); +if (ObjFileOrErr) { + processSymbols(**ObjFileOrErr); + return; +} + +// Then try processing as archive files. +llvm::consumeError(ObjFileOrErr.takeError()); +auto ArchiveOrErr = llvm::object::Archive::create(Buffer); +if (ArchiveOrErr) { + llvm::Error Err = llvm::Error::success(); + llvm::object::Archive = *ArchiveOrErr.get(); + for (auto : Archive.children(Err)) { +auto ChildBufOrErr = Child.getMemoryBufferRef(); +if (ChildBufOrErr) + processInput(*ChildBufOrErr); +else + errorHandler(ChildBufOrErr.takeError()); + } + + if (Err) +errorHandler(std::move(Err)); + return; +} + +// Ignore other files. +llvm::consumeError(ArchiveOrErr.takeError()); + } + void processSymbols(const llvm::object::ObjectFile ) { +for (const auto : Obj.symbols()) { + auto FlagOrErr = Symbol.getFlags(); + if (!FlagOrErr) { +errorHandler(FlagOrErr.takeError()); +continue; + } + + // Filter only undefined symbols + if (!(FlagOrErr.get() & llvm::object::SymbolRef::SF_Undefined)) { Artem-B wrote: style nit: remove `{}` around single-statement body. Applies here and in a handful of other places throughout the patch. https://github.com/llvm/llvm-project/pull/81700 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] Allow partial linking for `-fgpu-rdc` (PR #81700)
@@ -36,6 +47,146 @@ static std::string normalizeForBundler(const llvm::Triple , : T.normalize(); } +// Collect undefined __hip_fatbin* and __hip_gpubin_handle* symbols from all +// input object or archive files. +class HIPUndefinedFatBinSymbols { +public: + HIPUndefinedFatBinSymbols(const Compilation ) + : C(C), DiagID(C.getDriver().getDiags().getCustomDiagID( + DiagnosticsEngine::Error, + "Error collecting HIP undefined fatbin symbols: %0")), +Quiet(C.getArgs().hasArg(options::OPT__HASH_HASH_HASH)), +Verbose(C.getArgs().hasArg(options::OPT_v)) { +populateSymbols(); +if (Verbose) { + for (auto Name : FatBinSymbols) +llvm::errs() << "Found undefined HIP fatbin symbol: " << Name << "\n"; + for (auto Name : GPUBinHandleSymbols) +llvm::errs() << "Found undefined HIP gpubin handle symbol: " << Name + << "\n"; +} + } + + const std::set () const { +return FatBinSymbols; + } + + const std::set () const { +return GPUBinHandleSymbols; + } + +private: + const Compilation + unsigned DiagID; + bool Quiet; + bool Verbose; + std::set FatBinSymbols; + std::set GPUBinHandleSymbols; + const std::string FatBinPrefix = "__hip_fatbin"; + const std::string GPUBinHandlePrefix = "__hip_gpubin_handle"; + + void populateSymbols() { +std::deque WorkList; +std::set Visited; + +for (const auto : C.getActions()) { + WorkList.push_back(Action); +} + +while (!WorkList.empty()) { + const Action *CurrentAction = WorkList.front(); + WorkList.pop_front(); + + if (!CurrentAction || !Visited.insert(CurrentAction).second) +continue; + + if (const auto *IA = dyn_cast(CurrentAction)) { +std::string ID = IA->getId().str(); +if (!ID.empty()) { + ID = llvm::utohexstr(llvm::MD5Hash(ID), /*LowerCase=*/true); + FatBinSymbols.insert(Twine(FatBinPrefix + "_" + ID).str()); + GPUBinHandleSymbols.insert( + Twine(GPUBinHandlePrefix + "_" + ID).str()); + continue; +} +const char *Filename = IA->getInputArg().getValue(); +auto BufferOrErr = llvm::MemoryBuffer::getFile(Filename); +// Input action could be options to linker, therefore ignore it +// if cannot read it. Artem-B wrote: Comment could use some editing. `therefore, ignore an error if we fail to read the file`. This makes me ask -- what if the argument *is* an input file, and we do fail to read it. How do we tell apart the linker options from the input file? Relying on a failure to read it does not seem to be a good way to handle it. https://github.com/llvm/llvm-project/pull/81700 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] Allow partial linking for `-fgpu-rdc` (PR #81700)
@@ -36,6 +47,146 @@ static std::string normalizeForBundler(const llvm::Triple , : T.normalize(); } +// Collect undefined __hip_fatbin* and __hip_gpubin_handle* symbols from all +// input object or archive files. +class HIPUndefinedFatBinSymbols { +public: + HIPUndefinedFatBinSymbols(const Compilation ) + : C(C), DiagID(C.getDriver().getDiags().getCustomDiagID( + DiagnosticsEngine::Error, + "Error collecting HIP undefined fatbin symbols: %0")), +Quiet(C.getArgs().hasArg(options::OPT__HASH_HASH_HASH)), +Verbose(C.getArgs().hasArg(options::OPT_v)) { +populateSymbols(); +if (Verbose) { + for (auto Name : FatBinSymbols) +llvm::errs() << "Found undefined HIP fatbin symbol: " << Name << "\n"; + for (auto Name : GPUBinHandleSymbols) +llvm::errs() << "Found undefined HIP gpubin handle symbol: " << Name + << "\n"; +} + } + + const std::set () const { +return FatBinSymbols; + } + + const std::set () const { +return GPUBinHandleSymbols; + } + +private: + const Compilation + unsigned DiagID; + bool Quiet; + bool Verbose; + std::set FatBinSymbols; + std::set GPUBinHandleSymbols; + const std::string FatBinPrefix = "__hip_fatbin"; + const std::string GPUBinHandlePrefix = "__hip_gpubin_handle"; + + void populateSymbols() { +std::deque WorkList; +std::set Visited; + +for (const auto : C.getActions()) { + WorkList.push_back(Action); +} + +while (!WorkList.empty()) { + const Action *CurrentAction = WorkList.front(); + WorkList.pop_front(); + + if (!CurrentAction || !Visited.insert(CurrentAction).second) +continue; + + if (const auto *IA = dyn_cast(CurrentAction)) { +std::string ID = IA->getId().str(); +if (!ID.empty()) { + ID = llvm::utohexstr(llvm::MD5Hash(ID), /*LowerCase=*/true); + FatBinSymbols.insert(Twine(FatBinPrefix + "_" + ID).str()); + GPUBinHandleSymbols.insert( + Twine(GPUBinHandlePrefix + "_" + ID).str()); + continue; +} +const char *Filename = IA->getInputArg().getValue(); +auto BufferOrErr = llvm::MemoryBuffer::getFile(Filename); +// Input action could be options to linker, therefore ignore it +// if cannot read it. +if (!BufferOrErr) + continue; + +processInput(BufferOrErr.get()->getMemBufferRef()); + } else +WorkList.insert(WorkList.end(), CurrentAction->getInputs().begin(), +CurrentAction->getInputs().end()); +} + } + + void processInput(const llvm::MemoryBufferRef ) { +// Try processing as object file first. +auto ObjFileOrErr = llvm::object::ObjectFile::createObjectFile(Buffer); +if (ObjFileOrErr) { + processSymbols(**ObjFileOrErr); + return; +} + +// Then try processing as archive files. +llvm::consumeError(ObjFileOrErr.takeError()); +auto ArchiveOrErr = llvm::object::Archive::create(Buffer); +if (ArchiveOrErr) { + llvm::Error Err = llvm::Error::success(); + llvm::object::Archive = *ArchiveOrErr.get(); + for (auto : Archive.children(Err)) { +auto ChildBufOrErr = Child.getMemoryBufferRef(); +if (ChildBufOrErr) + processInput(*ChildBufOrErr); +else + errorHandler(ChildBufOrErr.takeError()); + } + + if (Err) +errorHandler(std::move(Err)); + return; +} + +// Ignore other files. +llvm::consumeError(ArchiveOrErr.takeError()); + } + void processSymbols(const llvm::object::ObjectFile ) { Artem-B wrote: Nit -- add an empty line to separate functions. https://github.com/llvm/llvm-project/pull/81700 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [HIP] Allow partial linking for `-fgpu-rdc` (PR #81700)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/81700 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LLVM] Add `__builtin_readsteadycounter` intrinsic and builtin for realtime clocks (PR #81331)
@@ -104,6 +104,7 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const { case ISD::ATOMIC_STORE: return "AtomicStore"; case ISD::PCMARKER: return "PCMarker"; case ISD::READCYCLECOUNTER: return "ReadCycleCounter"; + case ISD::READSTEADYCOUNTER: return "ReadFixedTimer"; Artem-B wrote: Should it be "ReadSteadyCounter" ? Also, whitespace/alignment looks off. https://github.com/llvm/llvm-project/pull/81331 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LLVM] Add `__builtin_readsteadycounter` intrinsic and builtin for realtime clocks (PR #81331)
@@ -2764,6 +2764,37 @@ Query for this feature with ``__has_builtin(__builtin_readcyclecounter)``. Note that even if present, its use may depend on run-time privilege or other OS controlled state. +``__builtin_readsteadycounter`` +-- + +``__builtin_readsteadycounter`` is used to access the fixed frequency counter +register (or a similar steady-rate clock) on those targets that support it. +The function is similar to ``__builtin_readcyclecounter`` above except that the +frequency is fixed, making it suitable for measuring elapsed time. Artem-B wrote: Should we mention that we do not guarantee any particular frequency, just that it's stable and it's up to the user to figure out the actual frequency, if they need to. https://github.com/llvm/llvm-project/pull/81331 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LLVM] Add `__builtin_readsteadycounter` intrinsic and builtin for realtime clocks (PR #81331)
https://github.com/Artem-B commented: LGTM with few nits for general and NVPTX parts. https://github.com/llvm/llvm-project/pull/81331 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LLVM] Add `__builtin_readsteadycounter` intrinsic and builtin for realtime clocks (PR #81331)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/81331 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add builtin support for 'globaltimer' (PR #79765)
@@ -140,6 +140,17 @@ define void @test_exit() { ret void } +; CHECK-LABEL: test_globaltimer +define i64 @test_globaltimer() { +; CHECK: mov.u64 %r{{.*}}, %globaltimer; + %a = tail call i64 @llvm.nvvm.read.ptx.sreg.globaltimer() Artem-B wrote: Thise need sm_30+. Right now the test runs with sm_30. LLVM does compile these intrinsics, but ptxas fails because the register is not available on sm_20. The test needs to be updated to use a reasonably new GPU target. Probably sm_60 is the oldest one anybody still cares about. https://github.com/llvm/llvm-project/pull/79765 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add clang builtin for `__nvvm_reflect` intrinsic (PR #81277)
Artem-B wrote: LGTM https://github.com/llvm/llvm-project/pull/81277 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add clang builtin for `__nvvm_reflect` intrinsic (PR #81277)
@@ -159,6 +159,7 @@ BUILTIN(__nvvm_read_ptx_sreg_pm3, "i", "n") BUILTIN(__nvvm_prmt, "UiUiUiUi", "") BUILTIN(__nvvm_exit, "v", "r") +BUILTIN(__nvvm_reflect, "UicC*", "r") Artem-B wrote: Now that we're exposing it to the end users. We should probably document what it does. Probably somewhere in https://clang.llvm.org/docs/LanguageExtensions.html https://github.com/llvm/llvm-project/pull/81277 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add clang builtin for `__nvvm_reflect` intrinsic (PR #81277)
@@ -1624,8 +1624,9 @@ def int_nvvm_compiler_error : def int_nvvm_compiler_warn : Intrinsic<[], [llvm_anyptr_ty], [], "llvm.nvvm.compiler.warn">; -def int_nvvm_reflect : - Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">; +def int_nvvm_reflect : + Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">, + ClangBuiltin<"__nvvm_reflect">; Artem-B wrote: I vaguely recall that OpenCL folks had to use it with a slightly different signature. I think their pointer argument was in an unusual address space, where OCL keeps their string constants. It would be great to double check that the new builtin does not break them. https://github.com/llvm/llvm-project/pull/81277 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add clang builtin for `__nvvm_reflect` intrinsic (PR #81277)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/81277 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add clang builtin for `__nvvm_reflect` intrinsic (PR #81277)
https://github.com/Artem-B approved this pull request. LGTM overall. https://github.com/llvm/llvm-project/pull/81277 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add clang builtin for `__nvvm_reflect` intrinsic (PR #81277)
Artem-B wrote: > We should expose it as an intrinsic I think you mean `builtin` here. https://github.com/llvm/llvm-project/pull/81277 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Allow 'all' as a generic bundled architecture (PR #81193)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/81193 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Allow 'all' as a generic bundled architecture (PR #81193)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/81193 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX][Draft] Make `__nvvm_nanosleep` a no-op if unsupported (PR #81033)
Artem-B wrote: > Okay, `__nvvm_reflect` doesn't work fully here because the `nanosleep` > builtin I added requires `sm_70` at the clang level. Either means I'd need to > go back to inline assembly or remove that requirement at least from clang so > it's a backend failure. The question is -- who's going to provide a fallback implementation for the nanosleepbuiltin for the older GPUs. I do not think it's LLVM's job, so constraining the builtin is appropriate. However, nothing stops you from providing your own implementation in libc using inline asm. Something along these lines: ``` __device__ void my_nanosleep(int N) { if (__nvvm_reflect(SM_70)) { asm volatile("nanosleep") } else { while(N--) { volatile asm("something unoptimizable") } } } ``` https://github.com/llvm/llvm-project/pull/81033 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX][Draft] Make `__nvvm_nanosleep` a no-op if unsupported (PR #81033)
Artem-B wrote: > This patch, which simply makes it legal on all architectures but do nothing > is it's older than sm_70. I do not think this is the right thing to do. "do nothing" is not what one would expect from a `nanosleep`. Let's unpack your problem a bit. __nvvm_reflect() is probably closest to what you would need. However, IIUIC, if you use it to provide nanosleep-based variant and an alternative for the older GPUs, the `nanosleep` variant code will still hang off the dead branch of if(__nvvm_reflect()) and if it's not eliminated by DCE (which it would not if optimizations are off), the resulting PTX will be invalid for the older GPUs. In other words, pushing nanosleep implementation into an intrinsic makes things compile everywhere at the expense of doing a wrong thing on the older GPUs. I do not think it's a good trade-off. Perhaps a better approach would be to incorporate dead branch elimination onto NVVMReflect pass itself. We do know that it is the explicit intent of `__nvvm_reflect()`. If NVVMReflect explicitly guarantees that the dead branch will be gone, it should allow you to use approach `#1` w/o concerns for whether optimizations are enabled and you should be able to provide whatever alternative implementation you need (even if it's a null one), without affecting correctness of LLVM itself. https://github.com/llvm/llvm-project/pull/81033 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [flang] [InstCombine] Canonicalize constant GEPs to i8 source element type (PR #68882)
Artem-B wrote: Another corner case here. Untyped GEP resulted in SimpifyCFG producing a `load(gep(argptr, cond ? 24 : 0))` instead of `load( cond ? gep(argptr, 24) : argptr)` it produced before the patch, and that eventually prevented SROA from processing that load. While it's not a bug in this patch, the consequence is a pretty serious performance regression in some GPU code. And we do not have a workaround. :-/ Minimized reproducer: ``` # opt -passes='inline,simplifycfg,instcombine,sroa' -S https://github.com/llvm/llvm-project/pull/68882 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [AMDGPU] Diagnose unaligned atomic (PR #80322)
https://github.com/Artem-B approved this pull request. You may want to check that we can still disable the error with `-Wno-error=atomic-alignment` passed via top-level options. Other than that LGTM. https://github.com/llvm/llvm-project/pull/80322 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
Artem-B wrote: > the idea is that it would be the desired effect if someone went out of their > way to do this GPU subset linking thing. That would only be true when someone owns the whole build. That will not be the case in practice. A large enough project is usually a bunch of libraries created by different teams and vendors. They may or may not be built together and how a particular library is built is often controlled by its owner and may not be visible to the end user. The owners may consider switching to device linking to be benign or irrelevant to the end users, but it will be observable by those upstream users. Being aware of the quirks introduced by device linking will be required for the owners of those libraries. You do know how it all works under the hood. Pretty much nobody else on the planet does. :-) Anyways. I think we're in agreement that we do need to document possible implications. clang-linker-wrapper docs would do. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
Artem-B wrote: > I'm assuming you're talking about GPU-side constructors? I don't think the > CUDA runtime supports those, but OpenMP runs them when the image is loaded, > so it would handle both independantly. Yes. I'm thinking of the expectations from a C++ user standpoint, and this is one of the areas where there will be observable differences. First, because there will be subsets of the code that are no longer part of the main GPU-side executable. Second, the side effects of the initializers will be different depending on whether we do link such subsets separately or not. E.g. the initializer call order will change. The global state changes in one subset will not be visible in the other. Weak symbol resolution will produce different results. Etc. > The idea is that users already get C++-like behavior with the new driver and > -fgpu-rdc generally Yes. That will set the default expectations that things work just like in C++, which is a great thing. But introduction of partial subset linking will break some of those "just works" assumptions and it may be triggered by the parts of the build outside of user's control (e.g. by a third-party library). Side note: we do need a good term for this kind of subset linking. "partial linking" already has established meaning and it's not a good fit here as we actually produce a fully linked GPU executable. > we don't need to worry about people being confused so long as we document > what it does. We do need to document how it works. Documenting what does not work, or works differently is also important, IMO. We *do* need to worry about users and their expectations. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
Artem-B wrote: Supporting such mixed mode opens an interesting set of issues we may need to consider going forward: * who/where/how runs initializers in the fully linked parts? * Are public functions in the fully linked parts visible to the functions in partially linked parts? In the full-rdc mode they would, as if it's a plain C++ compilation. In partial they would not as the main GPU executable and the partial parts will be in separate executables. This would be OK for something like CUDA where cross-TU references are usually limited to host, but would be surprising for someone who would expect C++-like behavior, which sort of the ultimate goal for offloading use case. This will eventually become a problem if/when we grow large enough subset of independent offload-enabled libraries. The top-level user will have a hard time figuring out what's visible and what is not, unless the libraries deliberately expose only host-level APIs, if/when they fully link GPU side code. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
@@ -265,6 +329,11 @@ Error runLinker(ArrayRef Files, const ArgList ) { LinkerArgs.push_back(Arg); if (Error Err = executeCommands(LinkerPath, LinkerArgs)) return Err; + + if (Args.hasArg(OPT_relocatable)) +if (Error Err = relocateOffloadSection(Args, ExecutableName)) Artem-B wrote: We could just `return relocateOffloadSection(Args, ExecutableName)` https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair; /// \param EntryArray Optional pair pointing to the `__start` and `__stop` /// symbols holding the `__tgt_offload_entry` array. /// \param Suffix An optional suffix appended to the emitted symbols. +/// \param Relocatable Indicate if we need to change the offloading section. Artem-B wrote: Nit: "Indicate whether the binary is a relocatable object" may work a bit better for describing intent. Current description seems to describe an implementation detail. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
https://github.com/Artem-B approved this pull request. LGTM. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
https://github.com/Artem-B edited https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
Artem-B wrote: So, the idea is to carry two separate embedded offloading sections -- one for already fully linked GPU executables, and another for GPU objects to be linked at the final link stage. > We also use a sepcial section called something like omp_offloading_entries Typo in 'special' in the description. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [NVPTX] Allow compiling LLVM-IR without `-march` set (PR #79873)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/79873 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [NVPTX] Allow compiling LLVM-IR without `-march` set (PR #79873)
Artem-B wrote: Considering that it's for the stand-alone compilation only, I'm not going to block this patch. That said, please add a `TODO` somewhere to address an issue w/ explicitly targeting generic variant. https://github.com/llvm/llvm-project/pull/79873 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [NVPTX] Allow compiling LLVM-IR without `-march` set (PR #79873)
Artem-B wrote: > Right now if you specify target-cpu you get target-cpu attributes, which is > what we don't want. I'm fine handling 'generic' in a special way under the hood and not specifying target-CPU. My concern is about user-facing interface. Command line options must be overridable. For the CPU I would be able to specify the variant that matches the default. For GPU I'll have no way to explicitly pick 'generic' as the target. I think this is important. https://github.com/llvm/llvm-project/pull/79873 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [NVPTX] Allow compiling LLVM-IR without `-march` set (PR #79873)
Artem-B wrote: > I think there's some precedent from both vendors to treat missing attributes > as a more generic target. It sounds more like a bug than a feature to me. The major difference between "you get sm_xx by default" and this "you get generic by default" is that With specific sm_XX, I can override it both ways -- I wan enable/disable it if I need to regardless of how it was specified before my overriding options. With the magic unnameable 'generic' target, I can only disable it by specifying it, but there's no way to enable it once a preceding option names some specific architecture. It makes little difference where you control complete build, but that is not the case for all builds. E.g. Tensorflow builds with bazel and the end user does not have access to whatever compiler flags global build rules may set. So if you want to build for generic GPU target, you will have to jump through way more hoops than is reasonable, as opposed to specifying a few overriding options you're interested in. I'm fine with defaulting to such generic target, but I do believe we need to handle it the same way as specific targets. https://github.com/llvm/llvm-project/pull/79873 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [CUDA] Change '__activemask' to use '__nvvm_activemask()' (PR #79892)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/79892 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [NVPTX] Allow compiling LLVM-IR without `-march` set (PR #79873)
Artem-B wrote: Relying on something *not* being defined is probably not the best way to handle 'generic' target. For starters it makes it hard or impossible to recreate the same compilation state by undoing already-specified option. It also breaks established assumption that there *is* a default target CPU/GPU. If we do want to have a generic GPU target, then we should grow an explicit 'generic' GPU variant, IMO. It would be a functional opposite of 'native'. https://github.com/llvm/llvm-project/pull/79873 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [NVPTX] Add builtin support for 'globaltimer' (PR #79765)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/79765 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [NVPTX] Add builtin support for 'nanosleep' PTX instrunction (PR #79888)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/79888 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
https://github.com/Artem-B approved this pull request. https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
@@ -4599,6 +4599,14 @@ def int_nvvm_vote_ballot_sync : [IntrInaccessibleMemOnly, IntrConvergent, IntrNoCallback], "llvm.nvvm.vote.ballot.sync">, ClangBuiltin<"__nvvm_vote_ballot_sync">; +// +// ACTIVEMASK +// +def int_nvvm_activemask : + Intrinsic<[llvm_i32_ty], [], +[IntrInaccessibleMemOnly, IntrConvergent, IntrNoCallback, IntrHasSideEffects], "llvm.nvvm.activemask">, + ClangBuiltin<"__nvvm_activemask">; Artem-B wrote: Separate patch is fine, too. https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
@@ -65,7 +65,7 @@ def : Proc<"sm_61", [SM61, PTX50]>; def : Proc<"sm_62", [SM62, PTX50]>; def : Proc<"sm_70", [SM70, PTX60]>; def : Proc<"sm_72", [SM72, PTX61]>; -def : Proc<"sm_75", [SM75, PTX63]>; +def : Proc<"sm_75", [SM75, PTX62, PTX63]>; Artem-B wrote: I'm confused a bit here. Constraints on PTX version for GPU and for instrunctions are independent. You need both satisfied in order to use a given instruction on a given GPU. So, to use activemask on sm_75, you do need PTX63. To use it on sm_52, you only need PTX62. You do not need to change anything here. You already have correct predicates applied to the instruction itself and to the target builtin. https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
@@ -65,7 +65,7 @@ def : Proc<"sm_61", [SM61, PTX50]>; def : Proc<"sm_62", [SM62, PTX50]>; def : Proc<"sm_70", [SM70, PTX60]>; def : Proc<"sm_72", [SM72, PTX61]>; -def : Proc<"sm_75", [SM75, PTX63]>; +def : Proc<"sm_75", [SM75, PTX62, PTX63]>; Artem-B wrote: What are you trying to do with PTX62 feature to start with? Why do you need to add it here to start with? In general, the features will be supplied externally. This particular place just sets the minimum required to support this particular GPU variant. https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
@@ -65,7 +65,7 @@ def : Proc<"sm_61", [SM61, PTX50]>; def : Proc<"sm_62", [SM62, PTX50]>; def : Proc<"sm_70", [SM70, PTX60]>; def : Proc<"sm_72", [SM72, PTX61]>; -def : Proc<"sm_75", [SM75, PTX63]>; +def : Proc<"sm_75", [SM75, PTX62, PTX63]>; Artem-B wrote: Why are we adding PTX62 here? According to [PTX docs](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes-ptx-release-history) sm_75 has been introduced in PTX ISA 6.3 in CUDA-10.0. https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
@@ -4599,6 +4599,14 @@ def int_nvvm_vote_ballot_sync : [IntrInaccessibleMemOnly, IntrConvergent, IntrNoCallback], "llvm.nvvm.vote.ballot.sync">, ClangBuiltin<"__nvvm_vote_ballot_sync">; +// +// ACTIVEMASK +// +def int_nvvm_activemask : + Intrinsic<[llvm_i32_ty], [], +[IntrInaccessibleMemOnly, IntrConvergent, IntrNoCallback, IntrHasSideEffects], "llvm.nvvm.activemask">, + ClangBuiltin<"__nvvm_activemask">; Artem-B wrote: Should we shange `__activemask` to use the new builtin instead of inline asm? https://github.com/llvm/llvm-project/blob/eac8d713a6682417d06f5ee7f90a8ce54a281df8/clang/lib/Headers/__clang_cuda_intrinsics.h#L214 https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
Artem-B wrote: https://bugs.llvm.org/show_bug.cgi?id=35249 https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[llvm] [clang] [NVPTX] Add 'activemask' builtin and intrinsic support (PR #79768)
Artem-B wrote: 'activemask' is a rather peculiar instruction which may not be a good candidate for exposing it to LLVM. The problem is that it can 'observe' the past branch decisions and reflects the state of not-yet-reconverged conditional branches. LLVM does not take it into account. Opaque inline assembly is the sledgehammer which stops LLVM from doing anything fancy with it. The intrinsic will need to have appropriately conservative attributes, at the very least. I think we've had a bug about that and, if I recall correctly, we could not come up with a good way to handle activemask. Let me try finding the details. https://github.com/llvm/llvm-project/pull/79768 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits