[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
https://github.com/jhuber6 closed https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
https://github.com/yxsamliu approved this pull request. LGTM. Thanks https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
jhuber6 wrote: ping https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
jhuber6 wrote: > > the idea is that it would be the desired effect if someone went out of > > their way to do this GPU subset linking thing. > > That would only be true when someone owns the whole build. That will not be > the case in practice. A large enough project is usually a bunch of libraries > created by different teams and vendors. They may or may not be built together > and how a particular library is built is often controlled by its owner and > may not be visible to the end user. The owners may consider switching to > device linking to be benign or irrelevant to the end users, but it will be > observable by those upstream users. To me, if the person owning the library chooses to do this in their build system it would speak to an intentional decision not to leak the GPU parts of their code. with shipped via a static library. More or less, this option would give people a way to ship static libraries with the same GPU / Offloading semantics as a shared library. Another advantage of this option is that it allows someone to build an application with GPU offloading using `clang` and then ship it as a static library that can be linked with GCC or some other compiler so long as the user has the appropriate runtime. > Being aware of the quirks introduced by device linking will be required for > the owners of those libraries. You do know how it all works under the hood. > Pretty much nobody else on the planet does. :-) > Very true, I need to remind myself that I'm probably the only one who knows how this thing works in detail. > Anyways. I think we're in agreement that we do need to document possible > implications. clang-linker-wrapper docs would do. Will do. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
Artem-B wrote: > the idea is that it would be the desired effect if someone went out of their > way to do this GPU subset linking thing. That would only be true when someone owns the whole build. That will not be the case in practice. A large enough project is usually a bunch of libraries created by different teams and vendors. They may or may not be built together and how a particular library is built is often controlled by its owner and may not be visible to the end user. The owners may consider switching to device linking to be benign or irrelevant to the end users, but it will be observable by those upstream users. Being aware of the quirks introduced by device linking will be required for the owners of those libraries. You do know how it all works under the hood. Pretty much nobody else on the planet does. :-) Anyways. I think we're in agreement that we do need to document possible implications. clang-linker-wrapper docs would do. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
jhuber6 wrote: > > I'm assuming you're talking about GPU-side constructors? I don't think the > > CUDA runtime supports those, but OpenMP runs them when the image is loaded, > > so it would handle both independantly. > > Yes. I'm thinking of the expectations from a C++ user standpoint, and this is > one of the areas where there will be observable differences. First, because > there will be subsets of the code that are no longer part of the main > GPU-side executable. Second, the side effects of the initializers will be > different depending on whether we do link such subsets separately or not. > E.g. the initializer call order will change. The global state changes in one > subset will not be visible in the other. Weak symbol resolution will produce > different results. Etc. It'll definitely have an effect different from full linking, but the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing. > > > The idea is that users already get C++-like behavior with the new driver > > and -fgpu-rdc generally > > Yes. That will set the default expectations that things work just like in > C++, which is a great thing. But introduction of partial subset linking will > break some of those "just works" assumptions and it may be triggered by the > parts of the build outside of user's control (e.g. by a third-party library). This was one of the things I was wondering about, since we could alternatively make a new flag for this outside of `-r` so it's explicit. Right now I just kind of assumed that passing `-r` through the offloading toolchain (via CUDA or whatever) was somewhat explicit enough, as if regular `-r` behaviour is desired they could just use `clang` or `ld` normally. > > Side note: we do need a good term for this kind of subset linking. "partial > linking" already has established meaning and it's not a good fit here as we > actually produce a fully linked GPU executable. > Yeah, coming up with a name is difficult. You could just call it device linking, since it's more or less just doing the device link step ahead of time instead of passing it to when we make the final executable. > We do need to document how it works. Documenting what does not work, or works > differently is also important, IMO. We _do_ need to worry about users and > their expectations. Yes, I should probably update this with some documentation. I'm not sure where it would go however, maybe just in the `clang-linker-wrapper`'s page. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
Artem-B wrote: > I'm assuming you're talking about GPU-side constructors? I don't think the > CUDA runtime supports those, but OpenMP runs them when the image is loaded, > so it would handle both independantly. Yes. I'm thinking of the expectations from a C++ user standpoint, and this is one of the areas where there will be observable differences. First, because there will be subsets of the code that are no longer part of the main GPU-side executable. Second, the side effects of the initializers will be different depending on whether we do link such subsets separately or not. E.g. the initializer call order will change. The global state changes in one subset will not be visible in the other. Weak symbol resolution will produce different results. Etc. > The idea is that users already get C++-like behavior with the new driver and > -fgpu-rdc generally Yes. That will set the default expectations that things work just like in C++, which is a great thing. But introduction of partial subset linking will break some of those "just works" assumptions and it may be triggered by the parts of the build outside of user's control (e.g. by a third-party library). Side note: we do need a good term for this kind of subset linking. "partial linking" already has established meaning and it's not a good fit here as we actually produce a fully linked GPU executable. > we don't need to worry about people being confused so long as we document > what it does. We do need to document how it works. Documenting what does not work, or works differently is also important, IMO. We *do* need to worry about users and their expectations. https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
https://github.com/jhuber6 updated https://github.com/llvm/llvm-project/pull/80066 >From af382e03e41ef679c35a6126a1b131a7a8a28360 Mon Sep 17 00:00:00 2001 From: Joseph Huber Date: Tue, 30 Jan 2024 15:34:22 -0600 Subject: [PATCH 1/3] [LinkerWrapper] Support relocatable linking for offloading Summary: The standard GPU compilation process embeds each intermediate object file into the host file at the `.llvm.offloading` section so it can be linked later. We also use a sepcial section called something like `omp_offloading_entries` to store all the globals that need to be registered by the runtime. The linker-wrapper's job is to link the embedded device code stored at this section and then emit code to register the linked image and the kernels and globals in the offloading entry section. One downside to RDC linking is that it can become quite big for very large projects that wish to make use of static linking. This patch changes the support for relocatable linking via `-r` to support a kind of "partial" RDC compilation for offloading languages. This primarily requires manually editing the embedded data in the output object file for the relocatable link. We need to rename the output section to make it distinct from the input sections that will be merged. We then delete the old embedded object code so it won't be linked further. We then need to rename the old offloading section so that it is private to the module. A runtime solution could also be done to defer entires that don't belong to the given GPU executable, but this is easier. Note that this does not work with COFF linking, only the ELF method for handling offloading entries, that could be made to work similarly. Given this support, the following compilation path should produce two distinct images for OpenMP offloading. ``` $ clang foo.c -fopenmp --offload-arch=native -c $ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang main.c merged.o -fopenmp --offload-arch=native $ ./a.out ``` Or similarly for HIP to effectively perform non-RDC mode compilation for a subset of files. ``` $ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c $ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc $ ./a.out ``` One question is whether or not this should be the default behaviour of `-r` when run through the linker-wrapper or a special option. Standard `-r` behavior is still possible if used without invoking the linker-wrapper and it guranteed to be correct. --- clang/test/Driver/linker-wrapper-image.c | 8 ++ clang/test/Driver/linker-wrapper.c| 32 +++- .../ClangLinkerWrapper.cpp| 78 +-- .../llvm/Frontend/Offloading/OffloadWrapper.h | 4 +- .../Frontend/Offloading/OffloadWrapper.cpp| 11 ++- llvm/lib/Object/OffloadBinary.cpp | 2 +- 6 files changed, 119 insertions(+), 16 deletions(-) diff --git a/clang/test/Driver/linker-wrapper-image.c b/clang/test/Driver/linker-wrapper-image.c index b5d8ae217a972..08f860f6cab0d 100644 --- a/clang/test/Driver/linker-wrapper-image.c +++ b/clang/test/Driver/linker-wrapper-image.c @@ -9,6 +9,8 @@ // RUN: -fembed-offload-object=%t.out // RUN: clang-linker-wrapper --print-wrapped-module --dry-run --host-triple=x86_64-unknown-linux-gnu \ // RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 | FileCheck %s --check-prefixes=OPENMP,OPENMP-ELF +// RUN: clang-linker-wrapper --print-wrapped-module --dry-run -r --host-triple=x86_64-unknown-linux-gnu \ +// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 | FileCheck %s --check-prefixes=OPENMP-ELF,OPENMP-REL // RUN: clang-linker-wrapper --print-wrapped-module --dry-run --host-triple=x86_64-unknown-windows-gnu \ // RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 | FileCheck %s --check-prefixes=OPENMP,OPENMP-COFF @@ -19,6 +21,8 @@ // OPENMP-COFF: @__start_omp_offloading_entries = weak_odr hidden constant [0 x %struct.__tgt_offload_entry] zeroinitializer, section "omp_offloading_entries$OA" // OPENMP-COFF-NEXT: @__stop_omp_offloading_entries = weak_odr hidden constant [0 x %struct.__tgt_offload_entry] zeroinitializer, section "omp_offloading_entries$OZ" +// OPENMP-REL: @.omp_offloading.device_image = internal unnamed_addr constant [[[SIZE:[0-9]+]] x i8] c"\10\FF\10\AD{{.*}}", section ".llvm.offloading.relocatable", align 8 + // OPENMP: @.omp_offloading.device_image = internal unnamed_addr constant [[[SIZE:[0-9]+]] x i8] c"\10\FF\10\AD{{.*}}", section ".llvm.offloading", align 8 // OPENMP-NEXT: @.omp_offloading.device_images = internal unnamed_addr constant [1 x %__tgt_device_image] [%__tgt_device_image { ptr getelementptr inbounds ([[[BEGIN:[0-9]+]] x i8], ptr @.omp_offloading.device_image, i64 1, i64 0), ptr getelementptr inbounds ([[[END:[0-9]+]] x i8], ptr @.omp_offloading.device_image, i64 1, i64 0), ptr
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
github-actions[bot] wrote: :warning: C/C++ code formatter, clang-format found issues in your code. :warning: You can test this locally with the following command: ``bash git-clang-format --diff 7155c1ef658b66132f15bf1406e84e68eed3358f 1a91a8a1a5bd0a0b6d47dd0e50e801a820400203 -- clang/test/Driver/linker-wrapper-image.c clang/test/Driver/linker-wrapper.c clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h llvm/lib/Frontend/Offloading/OffloadWrapper.cpp llvm/lib/Object/OffloadBinary.cpp `` View the diff from clang-format here. ``diff diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp index 53a1bd9960..03b5471862 100644 --- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp +++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp @@ -333,7 +333,7 @@ Error runLinker(ArrayRef Files, const ArgList ) { if (Args.hasArg(OPT_relocatable)) return relocateOffloadSection(Args, ExecutableName) - return Error::success(); +return Error::success(); } void printVersion(raw_ostream ) { `` https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
@@ -265,6 +329,11 @@ Error runLinker(ArrayRef Files, const ArgList ) { LinkerArgs.push_back(Arg); if (Error Err = executeCommands(LinkerPath, LinkerArgs)) return Err; + + if (Args.hasArg(OPT_relocatable)) +if (Error Err = relocateOffloadSection(Args, ExecutableName)) Artem-B wrote: We could just `return relocateOffloadSection(Args, ExecutableName)` https://github.com/llvm/llvm-project/pull/80066 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
llvmbot wrote: @llvm/pr-subscribers-clang Author: Joseph Huber (jhuber6) Changes Summary: The standard GPU compilation process embeds each intermediate object file into the host file at the `.llvm.offloading` section so it can be linked later. We also use a sepcial section called something like `omp_offloading_entries` to store all the globals that need to be registered by the runtime. The linker-wrapper's job is to link the embedded device code stored at this section and then emit code to register the linked image and the kernels and globals in the offloading entry section. One downside to RDC linking is that it can become quite big for very large projects that wish to make use of static linking. This patch changes the support for relocatable linking via `-r` to support a kind of "partial" RDC compilation for offloading languages. This primarily requires manually editing the embedded data in the output object file for the relocatable link. We need to rename the output section to make it distinct from the input sections that will be merged. We then delete the old embedded object code so it won't be linked further. We then need to rename the old offloading section so that it is private to the module. A runtime solution could also be done to defer entires that don't belong to the given GPU executable, but this is easier. Note that this does not work with COFF linking, only the ELF method for handling offloading entries, that could be made to work similarly. Given this support, the following compilation path should produce two distinct images for OpenMP offloading. ``` $ clang foo.c -fopenmp --offload-arch=native -c $ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang main.c merged.o -fopenmp --offload-arch=native $ ./a.out ``` Or similarly for HIP to effectively perform non-RDC mode compilation for a subset of files. ``` $ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c $ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc $ ./a.out ``` One question is whether or not this should be the default behaviour of `-r` when run through the linker-wrapper or a special option. Standard `-r` behavior is still possible if used without invoking the linker-wrapper and it guranteed to be correct. --- Full diff: https://github.com/llvm/llvm-project/pull/80066.diff 5 Files Affected: - (modified) clang/test/Driver/linker-wrapper.c (+2-1) - (modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+64-7) - (modified) llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h (+3-1) - (modified) llvm/lib/Frontend/Offloading/OffloadWrapper.cpp (+7-4) - (modified) llvm/lib/Object/OffloadBinary.cpp (+1-1) ``diff diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c index a8667c99977c5..21898faf295d4 100644 --- a/clang/test/Driver/linker-wrapper.c +++ b/clang/test/Driver/linker-wrapper.c @@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x; // RUN: --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \ // RUN: %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK -// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu +// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu // RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r +// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp index b682cc293d54b..53412eb2346de 100644 --- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp +++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp @@ -241,6 +241,63 @@ Expected findProgram(StringRef Name, ArrayRef Paths) { return *Path; } +/// Returns the hashed value for a constant string. +std::string getHash(StringRef Str) { + llvm::MD5 Hasher; + llvm::MD5::MD5Result Hash; + Hasher.update(Str); + Hasher.final(Hash); + return llvm::utohexstr(Hash.low(), /*LowerCase=*/true); +} + +/// Renames offloading entry sections in a relocatable link so they do not +/// conflict with a later link job. +Error relocateOffloadSection(const ArgList , StringRef Output) { + Expected ObjcopyPath = + findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")}); + if (!ObjcopyPath) +return ObjcopyPath.takeError(); + + // Use the linker output file to get a unique hash. This creates a unique + // identifier to rename the sections to that is deterministic to the contents. + auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("") +: MemoryBuffer::getFileOrSTDIN(Output); + if (!BufferOrErr) +return createStringError(inconvertibleErrorCode(), "Failed to open %s", +
[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)
https://github.com/jhuber6 created https://github.com/llvm/llvm-project/pull/80066 Summary: The standard GPU compilation process embeds each intermediate object file into the host file at the `.llvm.offloading` section so it can be linked later. We also use a sepcial section called something like `omp_offloading_entries` to store all the globals that need to be registered by the runtime. The linker-wrapper's job is to link the embedded device code stored at this section and then emit code to register the linked image and the kernels and globals in the offloading entry section. One downside to RDC linking is that it can become quite big for very large projects that wish to make use of static linking. This patch changes the support for relocatable linking via `-r` to support a kind of "partial" RDC compilation for offloading languages. This primarily requires manually editing the embedded data in the output object file for the relocatable link. We need to rename the output section to make it distinct from the input sections that will be merged. We then delete the old embedded object code so it won't be linked further. We then need to rename the old offloading section so that it is private to the module. A runtime solution could also be done to defer entires that don't belong to the given GPU executable, but this is easier. Note that this does not work with COFF linking, only the ELF method for handling offloading entries, that could be made to work similarly. Given this support, the following compilation path should produce two distinct images for OpenMP offloading. ``` $ clang foo.c -fopenmp --offload-arch=native -c $ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang main.c merged.o -fopenmp --offload-arch=native $ ./a.out ``` Or similarly for HIP to effectively perform non-RDC mode compilation for a subset of files. ``` $ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c $ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc $ ./a.out ``` One question is whether or not this should be the default behaviour of `-r` when run through the linker-wrapper or a special option. Standard `-r` behavior is still possible if used without invoking the linker-wrapper and it guranteed to be correct. >From 2172b4eda521025b779d7457ed762e3748f66db8 Mon Sep 17 00:00:00 2001 From: Joseph Huber Date: Tue, 30 Jan 2024 15:34:22 -0600 Subject: [PATCH] [LinkerWrapper] Support relocatable linking for offloading Summary: The standard GPU compilation process embeds each intermediate object file into the host file at the `.llvm.offloading` section so it can be linked later. We also use a sepcial section called something like `omp_offloading_entries` to store all the globals that need to be registered by the runtime. The linker-wrapper's job is to link the embedded device code stored at this section and then emit code to register the linked image and the kernels and globals in the offloading entry section. One downside to RDC linking is that it can become quite big for very large projects that wish to make use of static linking. This patch changes the support for relocatable linking via `-r` to support a kind of "partial" RDC compilation for offloading languages. This primarily requires manually editing the embedded data in the output object file for the relocatable link. We need to rename the output section to make it distinct from the input sections that will be merged. We then delete the old embedded object code so it won't be linked further. We then need to rename the old offloading section so that it is private to the module. A runtime solution could also be done to defer entires that don't belong to the given GPU executable, but this is easier. Note that this does not work with COFF linking, only the ELF method for handling offloading entries, that could be made to work similarly. Given this support, the following compilation path should produce two distinct images for OpenMP offloading. ``` $ clang foo.c -fopenmp --offload-arch=native -c $ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang main.c merged.o -fopenmp --offload-arch=native $ ./a.out ``` Or similarly for HIP to effectively perform non-RDC mode compilation for a subset of files. ``` $ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c $ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc $ ./a.out ``` One question is whether or not this should be the default behaviour of `-r` when run through the linker-wrapper or a special option. Standard `-r` behavior is still possible if used without invoking the linker-wrapper and it guranteed to be correct. --- clang/test/Driver/linker-wrapper.c| 3 +- .../ClangLinkerWrapper.cpp| 71