[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-12 Thread Nico Weber via Phabricator via cfe-commits
thakis added a comment.

Bot's happy again. Thanks for the quick fix!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D128914#3644022 , @thakis wrote:

> This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt
>
> Please take a look and revert for now if it takes a while to fix.

Let me know if rGfe6a391357fc 
 resolved 
the issue.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D128914#3644022 , @thakis wrote:

> This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt
>
> Please take a look and revert for now if it takes a while to fix.

I changed some of the argument formats in a previous patch, probably messed up 
the rebase. I'll try to land a patch real quick.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Nico Weber via Phabricator via cfe-commits
thakis added a comment.

This breaks check-clang on mac: http://45.33.8.238/macm1/39907/step_7.txt

Please take a look and revert for now if it takes a while to fix.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D128914#3643802 , @tra wrote:

> For what it's worth, NCCL  is the only 
> nontrivial library that needs RDC compilation that I'm aware of.
> It's also self-contained for RDC purposes we only need to use RDC on the 
> library TUs and do not need to propagate it to all CUDA TUs in the build.
>
> I believe such 'constrained' RDC compilation will likely be the reasonable 
> practical trade-off. It may not become the default compilation mode, but we 
> should be able to control where the "fully linked GPU executable" boundary is 
> and it's not necessarily going to match the fully-linked host executable.

Theoretically we could do this with a relocatable link using the 
linker-wrapper. The only problem with this approach are the `__start/__stop` 
linker defined variables that we use to iterate the globals to be registered as 
these are tied to the section specifically. Potentially, we could move these to 
a unique section so they don't interfere with anything. So it would be 
something like this

  clang-linker-wrapper -r a.o b.o c.o -o registered.o // Contains RTL calls to 
register all globals at section 'cuda_offloading_entries_'
  llvm-strip ---remove-section .llvm.offloading registered.o // Remove embedded 
IR so no other files will link against it
  llvm-objcopy --rename-section 
cuda_offloading_entries=cuda_offloading_entries_ registered.o // Change the 
registration section to something unique

Think this would work?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D128914#3643495 , @jhuber6 wrote:

> Yes, it's actually pretty difficult to find a CUDA application using 
> `fgpu-rdc`. It seems much more common to just stick everything that's needed 
> in the file.I've considered finding a CUDA / HIP benchmark suite and 
> comparing compile times using the new driver stuff. The benefit of having 
> `fgpu-rdc` be the default is that device code basically behaves exactly like 
> host code and LTO makes `fgpu-rdc` behave like `fno-gpu-rdc` performance 
> wise. The downside, as you mentioned, is compile time.

For what it's worth, NCCL  is the only 
nontrivial library that needs RDC compilation that I'm aware of.
It's also self-contained for RDC purposes we only need to use RDC on the 
library TUs and do not need to propagate it to all CUDA TUs in the build.

I believe such 'constrained' RDC compilation will likely be the reasonable 
practical trade-off. It may not become the default compilation mode, but we 
should be able to control where the "fully linked GPU executable" boundary is 
and it's not necessarily going to match the fully-linked host executable.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D128914#3643451 , @yaxunl wrote:

> If you only unregister fatbin once for the whole program, then it should be 
> safe -fgpu-rdc. I am not sure if that is the case.

it should be here, the generated handle is private to the registration module 
we created We only make one and it's impossible for anyone else to touch it 
even if mixing rdc with non-rdc codes.

> My experience with -fgpu-rdc is that it causes much longer linking time for 
> large applications like PyTorch or TensroFlow, and LTO does not help. This is 
> because the compiler has lots of inter-procedural optimization passes which 
> take more than linear time. Due to that those apps need to be compiled as 
> -fno-gpu-rdc. Actually most CUDA/HIP applications are using -fno-gpu-rdc.

Yes, it's actually pretty difficult to find a CUDA application using 
`fgpu-rdc`. It seems much more common to just stick everything that's needed in 
the file.I've considered finding a CUDA / HIP benchmark suite and comparing 
compile times using the new driver stuff. The benefit of having `fgpu-rdc` be 
the default is that device code basically behaves exactly like host code and 
LTO makes `fgpu-rdc` behave like `fno-gpu-rdc` performance wise. The downside, 
as you mentioned, is compile time.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D128914#3643270 , @jhuber6 wrote:

>> There is only one fatbin for -fgpu-rdc mode but the fatbin unregister 
>> function is called multiple times in each TU. HIP runtime expects each 
>> fatbin is unregistered only once. The old embedding scheme introduced a weak 
>> symbol to track whether the fabin has been unregistered and to make sure it 
>> is only unregistered once.
>
> I see, this wrapping will only happen in RDC-mode so it's probably safe to 
> ignore here? When I support non-RDC mode in the new driver it will most 
> likely rely on the old code generation. Although it's entirely feasible to 
> make RDC-mode the default. There's no runtime overhead when using LTO.

If you only unregister fatbin once for the whole program, then it should be 
safe -fgpu-rdc. I am not sure if that is the case.

My experience with -fgpu-rdc is that it causes much longer linking time for 
large applications like PyTorch or TensroFlow, and LTO does not help. This is 
because the compiler has lots of inter-procedural optimization passes which 
take more than linear time. Due to that those apps need to be compiled as 
-fno-gpu-rdc. Actually most CUDA/HIP applications are using -fno-gpu-rdc.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Joseph Huber via Phabricator via cfe-commits
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rGce091eb3b91f: [HIP] Add support for handling HIP in the 
linker wrapper (authored by jhuber6).

Changed prior to commit:
  https://reviews.llvm.org/D128914?vs=442356=443724#toc

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

Files:
  clang/test/Driver/linker-wrapper-image.c
  clang/test/Driver/linker-wrapper.c
  clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
  clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
  clang/tools/clang-linker-wrapper/OffloadWrapper.h

Index: clang/tools/clang-linker-wrapper/OffloadWrapper.h
===
--- clang/tools/clang-linker-wrapper/OffloadWrapper.h
+++ clang/tools/clang-linker-wrapper/OffloadWrapper.h
@@ -21,4 +21,8 @@
 /// registers the images with the CUDA runtime.
 llvm::Error wrapCudaBinary(llvm::Module , llvm::ArrayRef Images);
 
+/// Wraps the input bundled image into the module \p M as global symbols and
+/// registers the images with the HIP runtime.
+llvm::Error wrapHIPBinary(llvm::Module , llvm::ArrayRef Images);
+
 #endif
Index: clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
===
--- clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
+++ clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
@@ -22,6 +22,7 @@
 namespace {
 /// Magic number that begins the section containing the CUDA fatbinary.
 constexpr unsigned CudaFatMagic = 0x466243b1;
+constexpr unsigned HIPFatMagic = 0x48495046;
 
 /// Copied from clang/CGCudaRuntime.h.
 enum OffloadEntryKindFlag : uint32_t {
@@ -288,14 +289,15 @@
 
 /// Embed the image \p Image into the module \p M so it can be found by the
 /// runtime.
-GlobalVariable *createFatbinDesc(Module , ArrayRef Image) {
+GlobalVariable *createFatbinDesc(Module , ArrayRef Image, bool IsHIP) {
   LLVMContext  = M.getContext();
   llvm::Type *Int8PtrTy = Type::getInt8PtrTy(C);
   llvm::Triple Triple = llvm::Triple(M.getTargetTriple());
 
   // Create the global string containing the fatbinary.
   StringRef FatbinConstantSection =
-  Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";
+  IsHIP ? ".hip_fatbin"
+: (Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin");
   auto *Data = ConstantDataArray::get(C, Image);
   auto *Fatbin = new GlobalVariable(M, Data->getType(), /*isConstant*/ true,
 GlobalVariable::InternalLinkage, Data,
@@ -303,10 +305,11 @@
   Fatbin->setSection(FatbinConstantSection);
 
   // Create the fatbinary wrapper
-  StringRef FatbinWrapperSection =
-  Triple.isMacOSX() ? "__NV_CUDA,__fatbin" : ".nvFatBinSegment";
+  StringRef FatbinWrapperSection = IsHIP   ? ".hipFatBinSegment"
+   : Triple.isMacOSX() ? "__NV_CUDA,__fatbin"
+   : ".nvFatBinSegment";
   Constant *FatbinWrapper[] = {
-  ConstantInt::get(Type::getInt32Ty(C), CudaFatMagic),
+  ConstantInt::get(Type::getInt32Ty(C), IsHIP ? HIPFatMagic : CudaFatMagic),
   ConstantInt::get(Type::getInt32Ty(C), 1),
   ConstantExpr::getPointerBitCastOrAddrSpaceCast(Fatbin, Int8PtrTy),
   ConstantPointerNull::get(Type::getInt8PtrTy(C))};
@@ -328,9 +331,10 @@
   ConstantAggregateZero::get(ArrayType::get(getEntryTy(M), 0u));
   auto *DummyEntry = new GlobalVariable(
   M, DummyInit->getType(), true, GlobalVariable::ExternalLinkage, DummyInit,
-  "__dummy.cuda_offloading.entry");
-  DummyEntry->setSection("cuda_offloading_entries");
+  IsHIP ? "__dummy.hip_offloading.entry" : "__dummy.cuda_offloading.entry");
   DummyEntry->setVisibility(GlobalValue::HiddenVisibility);
+  DummyEntry->setSection(IsHIP ? "hip_offloading_entries"
+   : "cuda_offloading_entries");
 
   return FatbinDesc;
 }
@@ -358,7 +362,7 @@
 /// 0, entry->size, 0, 0);
 ///   }
 /// }
-Function *createRegisterGlobalsFunction(Module ) {
+Function *createRegisterGlobalsFunction(Module , bool IsHIP) {
   LLVMContext  = M.getContext();
   // Get the __cudaRegisterFunction function declaration.
   auto *RegFuncTy = FunctionType::get(
@@ -368,8 +372,8 @@
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt8PtrTy(C),
Type::getInt8PtrTy(C), Type::getInt32PtrTy(C)},
   /*isVarArg*/ false);
-  FunctionCallee RegFunc =
-  M.getOrInsertFunction("__cudaRegisterFunction", RegFuncTy);
+  FunctionCallee RegFunc = M.getOrInsertFunction(
+  IsHIP ? "__hipRegisterFunction" : "__cudaRegisterFunction", RegFuncTy);
 
   // Get the __cudaRegisterVar function declaration.
   auto *RegVarTy = FunctionType::get(
@@ -378,25 +382,31 @@
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), 

[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D128914#3642869 , @yaxunl wrote:

> In D128914#3642567 , @jhuber6 wrote:
>
>> In D128914#3642558 , 
>> @JonChesterfield wrote:
>>
>>> Code looks good to me. It's hard to be sure whether it works without 
>>> running a bunch of hip test cases through it, have you already done so? If 
>>> it doesn't work out of the box it should be close enough to fix up post 
>>> commit, e.g. when trying to move hip over to this by default.
>>
>> Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, 
>> RSBench, SU3Bench) using this method and they passed without issue. The only 
>> thing I was unsure about what whether or not the handle needed to be checked 
>> for null, because my testing suggested it's unnecessary. I was hoping one of 
>> the HIP developers would let me know. We can think about making this the 
>> default approach when I make the new driver work for `non-rdc` mode 
>> compilations.
>
> There is only one fatbin for -fgpu-rdc mode but the fatbin unregister 
> function is called multiple times in each TU. HIP runtime expects each fatbin 
> is unregistered only once. The old embedding scheme introduced a weak symbol 
> to track whether the fabin has been unregistered and to make sure it is only 
> unregistered once.

I see, this wrapping will only happen in RDC-mode so it's probably safe to 
ignore here? When I support non-RDC mode in the new driver it will most likely 
rely on the old code generation. Although it's entirely feasible to make 
RDC-mode the default. There's no runtime overhead when using LTO.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D128914#3642567 , @jhuber6 wrote:

> In D128914#3642558 , 
> @JonChesterfield wrote:
>
>> Code looks good to me. It's hard to be sure whether it works without running 
>> a bunch of hip test cases through it, have you already done so? If it 
>> doesn't work out of the box it should be close enough to fix up post commit, 
>> e.g. when trying to move hip over to this by default.
>
> Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, 
> RSBench, SU3Bench) using this method and they passed without issue. The only 
> thing I was unsure about what whether or not the handle needed to be checked 
> for null, because my testing suggested it's unnecessary. I was hoping one of 
> the HIP developers would let me know. We can think about making this the 
> default approach when I make the new driver work for `non-rdc` mode 
> compilations.

There is only one fatbin for -fgpu-rdc mode but the fatbin unregister function 
is called multiple times in each TU. HIP runtime expects each fatbin is 
unregistered only once. The old embedding scheme introduced a weak symbol to 
track whether the fabin has been unregistered and to make sure it is only 
unregistered once.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D128914#3642558 , @JonChesterfield 
wrote:

> Code looks good to me. It's hard to be sure whether it works without running 
> a bunch of hip test cases through it, have you already done so? If it doesn't 
> work out of the box it should be close enough to fix up post commit, e.g. 
> when trying to move hip over to this by default.

Thanks for the review, I ran a couple mini-apps with HIP versions (XSBench, 
RSBench, SU3Bench) using this method and they passed without issue. The only 
thing I was unsure about what whether or not the handle needed to be checked 
for null, because my testing suggested it's unnecessary. I was hoping one of 
the HIP developers would let me know. We can think about making this the 
default approach when I make the new driver work for `non-rdc` mode 
compilations.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-11 Thread Jon Chesterfield via Phabricator via cfe-commits
JonChesterfield accepted this revision.
JonChesterfield added a comment.
This revision is now accepted and ready to land.

Code looks good to me. It's hard to be sure whether it works without running a 
bunch of hip test cases through it, have you already done so? If it doesn't 
work out of the box it should be close enough to fix up post commit, e.g. when 
trying to move hip over to this by default.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-07 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

ping


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-07-05 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 updated this revision to Diff 442356.
jhuber6 added a comment.

Addressing some comments.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

Files:
  clang/test/Driver/linker-wrapper-image.c
  clang/test/Driver/linker-wrapper.c
  clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
  clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
  clang/tools/clang-linker-wrapper/OffloadWrapper.h

Index: clang/tools/clang-linker-wrapper/OffloadWrapper.h
===
--- clang/tools/clang-linker-wrapper/OffloadWrapper.h
+++ clang/tools/clang-linker-wrapper/OffloadWrapper.h
@@ -21,4 +21,8 @@
 /// registers the images with the CUDA runtime.
 llvm::Error wrapCudaBinary(llvm::Module , llvm::ArrayRef Images);
 
+/// Wraps the input bundled image into the module \p M as global symbols and
+/// registers the images with the HIP runtime.
+llvm::Error wrapHIPBinary(llvm::Module , llvm::ArrayRef Images);
+
 #endif
Index: clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
===
--- clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
+++ clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
@@ -22,6 +22,7 @@
 namespace {
 /// Magic number that begins the section containing the CUDA fatbinary.
 constexpr unsigned CudaFatMagic = 0x466243b1;
+constexpr unsigned HIPFatMagic = 0x48495046;
 
 /// Copied from clang/CGCudaRuntime.h.
 enum OffloadEntryKindFlag : uint32_t {
@@ -288,14 +289,15 @@
 
 /// Embed the image \p Image into the module \p M so it can be found by the
 /// runtime.
-GlobalVariable *createFatbinDesc(Module , ArrayRef Image) {
+GlobalVariable *createFatbinDesc(Module , ArrayRef Image, bool IsHIP) {
   LLVMContext  = M.getContext();
   llvm::Type *Int8PtrTy = Type::getInt8PtrTy(C);
   llvm::Triple Triple = llvm::Triple(M.getTargetTriple());
 
   // Create the global string containing the fatbinary.
   StringRef FatbinConstantSection =
-  Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";
+  IsHIP ? ".hip_fatbin"
+: (Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin");
   auto *Data = ConstantDataArray::get(C, Image);
   auto *Fatbin = new GlobalVariable(M, Data->getType(), /*isConstant*/ true,
 GlobalVariable::InternalLinkage, Data,
@@ -303,10 +305,11 @@
   Fatbin->setSection(FatbinConstantSection);
 
   // Create the fatbinary wrapper
-  StringRef FatbinWrapperSection =
-  Triple.isMacOSX() ? "__NV_CUDA,__fatbin" : ".nvFatBinSegment";
+  StringRef FatbinWrapperSection = IsHIP   ? ".hipFatBinSegment"
+   : Triple.isMacOSX() ? "__NV_CUDA,__fatbin"
+   : ".nvFatBinSegment";
   Constant *FatbinWrapper[] = {
-  ConstantInt::get(Type::getInt32Ty(C), CudaFatMagic),
+  ConstantInt::get(Type::getInt32Ty(C), IsHIP ? HIPFatMagic : CudaFatMagic),
   ConstantInt::get(Type::getInt32Ty(C), 1),
   ConstantExpr::getPointerBitCastOrAddrSpaceCast(Fatbin, Int8PtrTy),
   ConstantPointerNull::get(Type::getInt8PtrTy(C))};
@@ -328,9 +331,10 @@
   ConstantAggregateZero::get(ArrayType::get(getEntryTy(M), 0u));
   auto *DummyEntry = new GlobalVariable(
   M, DummyInit->getType(), true, GlobalVariable::ExternalLinkage, DummyInit,
-  "__dummy.cuda_offloading.entry");
-  DummyEntry->setSection("cuda_offloading_entries");
+  IsHIP ? "__dummy.hip_offloading.entry" : "__dummy.cuda_offloading.entry");
   DummyEntry->setVisibility(GlobalValue::HiddenVisibility);
+  DummyEntry->setSection(IsHIP ? "hip_offloading_entries"
+   : "cuda_offloading_entries");
 
   return FatbinDesc;
 }
@@ -358,7 +362,7 @@
 /// 0, entry->size, 0, 0);
 ///   }
 /// }
-Function *createRegisterGlobalsFunction(Module ) {
+Function *createRegisterGlobalsFunction(Module , bool IsHIP) {
   LLVMContext  = M.getContext();
   // Get the __cudaRegisterFunction function declaration.
   auto *RegFuncTy = FunctionType::get(
@@ -368,8 +372,8 @@
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt8PtrTy(C),
Type::getInt8PtrTy(C), Type::getInt32PtrTy(C)},
   /*isVarArg*/ false);
-  FunctionCallee RegFunc =
-  M.getOrInsertFunction("__cudaRegisterFunction", RegFuncTy);
+  FunctionCallee RegFunc = M.getOrInsertFunction(
+  IsHIP ? "__hipRegisterFunction" : "__cudaRegisterFunction", RegFuncTy);
 
   // Get the __cudaRegisterVar function declaration.
   auto *RegVarTy = FunctionType::get(
@@ -378,25 +382,31 @@
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt32Ty(C),
getSizeTTy(M), Type::getInt32Ty(C), Type::getInt32Ty(C)},
   /*isVarArg*/ false);
-  FunctionCallee RegVar = M.getOrInsertFunction("__cudaRegisterVar", RegVarTy);
+  FunctionCallee RegVar = 

[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-06-30 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

Thanks for the comments.




Comment at: clang/test/Driver/linker-wrapper.c:109
 
 // RUN: clang-offload-packager -o %t-lib.out \
 // RUN:   
--image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
 \

tra wrote:
> Nit: This test case does not have any CHECK lines and could use a comment 
> describing what it's supposed to test. AFAICT it's intended to make sure that 
> no temporary files are left around, but I'm not 100% sure.
Yes, it ensures that the files extracted from the static library are not 
leftover as temp files, this was a problem previously that I fixed. I'll add a 
comment explaining that.



Comment at: clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp:620-622
+  CmdArgs.push_back("-bundle-align=4096");
+
+  SmallVector Targets = {"-targets=host-x86_64-unknown-linux"};

tra wrote:
> We probably do not want to hardcode the assumption that the host is x86_64 
> linux. 
> 
> Bundle alignment should probably also be target-dependent, but 4K is common 
> enough and is probably fine in practice.
This is exactly the way it is in the Clang source for HIP. HIP uses the 
`clang-offload-bundler` which expects a host file and host triple, ergo the 
dummy triple and input from `/dev/null`. This obviously isn't great, maybe in 
the future I'll be able to convince the AMD folks to use my format instead.



Comment at: clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp:1205
+bundleHIP(ArrayRef Images) {
+  SmallVector> Buffers;
+

tra wrote:
> I'd move it to the end where the buffer is actually used. 
Sure, I'll do that for the others as well.



Comment at: clang/tools/clang-linker-wrapper/OffloadWrapper.cpp:393
+ /*Initializer*/ nullptr,
+ IsHIP ? "__start_hip_offloading_entries"
+   : "__start_cuda_offloading_entries");

tra wrote:
> We should probably have a helper function returning properly prefixed name, 
> similar to what we do in clang:
> https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGCUDANV.cpp#L184
I had that thought, but unless I wanted to use regular expressions it would be 
a little weird since there's many different types here, e.g. `__cuda`, `.cuda` 
and `_cuda`. I figured it was easier to just make two strings rather than carry 
around three  different functions to handle these cases, or introduce some 
weird regex.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-06-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

Syntax/style looks OK to me with a few nits.




Comment at: clang/test/Driver/linker-wrapper.c:109
 
 // RUN: clang-offload-packager -o %t-lib.out \
 // RUN:   
--image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
 \

Nit: This test case does not have any CHECK lines and could use a comment 
describing what it's supposed to test. AFAICT it's intended to make sure that 
no temporary files are left around, but I'm not 100% sure.



Comment at: clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp:620-622
+  CmdArgs.push_back("-bundle-align=4096");
+
+  SmallVector Targets = {"-targets=host-x86_64-unknown-linux"};

We probably do not want to hardcode the assumption that the host is x86_64 
linux. 

Bundle alignment should probably also be target-dependent, but 4K is common 
enough and is probably fine in practice.



Comment at: clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp:1205
+bundleHIP(ArrayRef Images) {
+  SmallVector> Buffers;
+

I'd move it to the end where the buffer is actually used. 



Comment at: clang/tools/clang-linker-wrapper/OffloadWrapper.cpp:393
+ /*Initializer*/ nullptr,
+ IsHIP ? "__start_hip_offloading_entries"
+   : "__start_cuda_offloading_entries");

We should probably have a helper function returning properly prefixed name, 
similar to what we do in clang:
https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CGCUDANV.cpp#L184


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128914/new/

https://reviews.llvm.org/D128914

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D128914: [HIP] Add support for handling HIP in the linker wrapper

2022-06-30 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 created this revision.
jhuber6 added reviewers: jdoerfert, JonChesterfield, yaxunl, tra.
Herald added a project: All.
jhuber6 requested review of this revision.
Herald added subscribers: cfe-commits, sstefan1.
Herald added a project: clang.

This patch adds the necessary changes required to bundle and wrap HIP
files. The bundling is done using `clang-offload-bundler` currently to
mimic `fatbinary` and the wrapping is done using very similar runtime
calls to CUDA. This still does not support managed / surface / texture
variables, that would require some additional information in the entry.

One difference in the codegeneration with AMD is that I don't check if
the handle is null before destructing it, I'm not sure if that's
required.

With this we should be able to support HIP with the new driver.

Depends on D128850 


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D128914

Files:
  clang/test/Driver/linker-wrapper-image.c
  clang/test/Driver/linker-wrapper.c
  clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
  clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
  clang/tools/clang-linker-wrapper/OffloadWrapper.h

Index: clang/tools/clang-linker-wrapper/OffloadWrapper.h
===
--- clang/tools/clang-linker-wrapper/OffloadWrapper.h
+++ clang/tools/clang-linker-wrapper/OffloadWrapper.h
@@ -21,4 +21,8 @@
 /// registers the images with the CUDA runtime.
 llvm::Error wrapCudaBinary(llvm::Module , llvm::ArrayRef Images);
 
+/// Wraps the input bundled image into the module \p M as global symbols and
+/// registers the images with the HIP runtime.
+llvm::Error wrapHIPBinary(llvm::Module , llvm::ArrayRef Images);
+
 #endif
Index: clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
===
--- clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
+++ clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
@@ -22,6 +22,7 @@
 namespace {
 /// Magic number that begins the section containing the CUDA fatbinary.
 constexpr unsigned CudaFatMagic = 0x466243b1;
+constexpr unsigned HIPFatMagic = 0x48495046;
 
 /// Copied from clang/CGCudaRuntime.h.
 enum OffloadEntryKindFlag : uint32_t {
@@ -288,14 +289,15 @@
 
 /// Embed the image \p Image into the module \p M so it can be found by the
 /// runtime.
-GlobalVariable *createFatbinDesc(Module , ArrayRef Image) {
+GlobalVariable *createFatbinDesc(Module , ArrayRef Image, bool IsHIP) {
   LLVMContext  = M.getContext();
   llvm::Type *Int8PtrTy = Type::getInt8PtrTy(C);
   llvm::Triple Triple = llvm::Triple(M.getTargetTriple());
 
   // Create the global string containing the fatbinary.
   StringRef FatbinConstantSection =
-  Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";
+  IsHIP ? ".hip_fatbin"
+: (Triple.isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin");
   auto *Data = ConstantDataArray::get(C, Image);
   auto *Fatbin = new GlobalVariable(M, Data->getType(), /*isConstant*/ true,
 GlobalVariable::InternalLinkage, Data,
@@ -303,10 +305,11 @@
   Fatbin->setSection(FatbinConstantSection);
 
   // Create the fatbinary wrapper
-  StringRef FatbinWrapperSection =
-  Triple.isMacOSX() ? "__NV_CUDA,__fatbin" : ".nvFatBinSegment";
+  StringRef FatbinWrapperSection = IsHIP   ? ".hipFatBinSegment"
+   : Triple.isMacOSX() ? "__NV_CUDA,__fatbin"
+   : ".nvFatBinSegment";
   Constant *FatbinWrapper[] = {
-  ConstantInt::get(Type::getInt32Ty(C), CudaFatMagic),
+  ConstantInt::get(Type::getInt32Ty(C), IsHIP ? HIPFatMagic : CudaFatMagic),
   ConstantInt::get(Type::getInt32Ty(C), 1),
   ConstantExpr::getPointerBitCastOrAddrSpaceCast(Fatbin, Int8PtrTy),
   ConstantPointerNull::get(Type::getInt8PtrTy(C))};
@@ -328,9 +331,10 @@
   ConstantAggregateZero::get(ArrayType::get(getEntryTy(M), 0u));
   auto *DummyEntry = new GlobalVariable(
   M, DummyInit->getType(), true, GlobalVariable::ExternalLinkage, DummyInit,
-  "__dummy.cuda_offloading.entry");
-  DummyEntry->setSection("cuda_offloading_entries");
+  IsHIP ? "__dummy.hip_offloading.entry" : "__dummy.cuda_offloading.entry");
   DummyEntry->setVisibility(GlobalValue::HiddenVisibility);
+  DummyEntry->setSection(IsHIP ? "hip_offloading_entries"
+   : "cuda_offloading_entries");
 
   return FatbinDesc;
 }
@@ -358,7 +362,7 @@
 /// 0, entry->size, 0, 0);
 ///   }
 /// }
-Function *createRegisterGlobalsFunction(Module ) {
+Function *createRegisterGlobalsFunction(Module , bool IsHIP) {
   LLVMContext  = M.getContext();
   // Get the __cudaRegisterFunction function declaration.
   auto *RegFuncTy = FunctionType::get(
@@ -368,8 +372,8 @@
Type::getInt8PtrTy(C), Type::getInt8PtrTy(C), Type::getInt8PtrTy(C),