[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-02-07 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 closed 
https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-02-07 Thread Yaxun Liu via cfe-commits

https://github.com/yxsamliu approved this pull request.

LGTM. Thanks

https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-02-07 Thread Joseph Huber via cfe-commits

jhuber6 wrote:

ping

https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-31 Thread Joseph Huber via cfe-commits

jhuber6 wrote:

> > the idea is that it would be the desired effect if someone went out of 
> > their way to do this GPU subset linking thing.
> 
> That would only be true when someone owns the whole build. That will not be 
> the case in practice. A large enough project is usually a bunch of libraries 
> created by different teams and vendors. They may or may not be built together 
> and how a particular library is built is often controlled by its owner and 
> may not be visible to the end user. The owners may consider switching to 
> device linking to be benign or irrelevant to the end users, but it will be 
> observable by those upstream users.

To me, if the person owning the library chooses to do this in their build 
system it would speak to an intentional decision not to leak the GPU parts of 
their code. with shipped via a static library. More or less, this option would 
give people a way to ship static libraries with the same GPU / Offloading 
semantics as a shared library.

Another advantage of this option is that it allows someone to build an 
application with GPU offloading using `clang` and then ship it as a static 
library that can be linked with GCC or some other compiler so long as the user 
has the appropriate runtime.

> Being aware of the quirks introduced by device linking will be required for 
> the owners of those libraries. You do know how it all works under the hood. 
> Pretty much nobody else on the planet does. :-)
>

Very true, I need to remind myself that I'm probably the only one who knows how 
this thing works in detail.

> Anyways. I think we're in agreement that we do need to document possible 
> implications. clang-linker-wrapper docs would do.

Will do.

https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-31 Thread Artem Belevich via cfe-commits

Artem-B wrote:

> the idea is that it would be the desired effect if someone went out of their 
> way to do this GPU subset linking thing.

That would only be true when someone owns the whole build. That will not be the 
case in practice. A large enough project is usually a bunch of libraries 
created by different teams and vendors. They may or may not be built together 
and how a particular library is built is often controlled by its owner and may 
not be visible to the end user. The owners may consider switching to device 
linking to be benign or irrelevant to the end users, but it will be observable 
by those upstream users.

Being aware of the quirks introduced by device linking will be required for the 
owners of those libraries. You do know how it all works under the hood. Pretty 
much nobody else on the planet does. :-)

Anyways. I think we're in agreement that we do need to document possible 
implications.  clang-linker-wrapper docs would do. 




https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-31 Thread Joseph Huber via cfe-commits

jhuber6 wrote:

> > I'm assuming you're talking about GPU-side constructors? I don't think the 
> > CUDA runtime supports those, but OpenMP runs them when the image is loaded, 
> > so it would handle both independantly.
> 
> Yes. I'm thinking of the expectations from a C++ user standpoint, and this is 
> one of the areas where there will be observable differences. First, because 
> there will be subsets of the code that are no longer part of the main 
> GPU-side executable. Second, the side effects of the initializers will be 
> different depending on whether we do link such subsets separately or not. 
> E.g. the initializer call order will change. The global state changes in one 
> subset will not be visible in the other. Weak symbol resolution will produce 
> different results. Etc.

It'll definitely have an effect different from full linking, but the idea is 
that it would be the desired effect if someone went out of their way to do this 
GPU subset linking thing.
> 
> > The idea is that users already get C++-like behavior with the new driver 
> > and -fgpu-rdc generally
> 
> Yes. That will set the default expectations that things work just like in 
> C++, which is a great thing. But introduction of partial subset linking will 
> break some of those "just works" assumptions and it may be triggered by the 
> parts of the build outside of user's control (e.g. by a third-party library).

This was one of the things I was wondering about, since we could alternatively 
make a new flag for this outside of `-r` so it's explicit. Right now I just 
kind of assumed that passing `-r` through the offloading toolchain (via CUDA or 
whatever) was somewhat explicit enough, as if regular `-r` behaviour is desired 
they could just use `clang` or `ld` normally.

> 
> Side note: we do need a good term for this kind of subset linking. "partial 
> linking" already has established meaning and it's not a good fit here as we 
> actually produce a fully linked GPU executable.
> 

Yeah, coming up with a name is difficult. You could just call it device 
linking, since it's more or less just doing the device link step ahead of time 
instead of passing it to when we make the final executable.

> We do need to document how it works. Documenting what does not work, or works 
> differently is also important, IMO. We _do_ need to worry about users and 
> their expectations.

Yes, I should probably update this with some documentation. I'm not sure where 
it would go however, maybe just in the `clang-linker-wrapper`'s page.

https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-31 Thread Artem Belevich via cfe-commits

Artem-B wrote:

> I'm assuming you're talking about GPU-side constructors? I don't think the 
> CUDA runtime supports those, but OpenMP runs them when the image is loaded, 
> so it would handle both independantly.

Yes. I'm thinking of the expectations from a C++ user standpoint, and this is 
one of the areas where there will be observable differences. First, because 
there will be subsets of the code that are no longer part of the main GPU-side 
executable. Second, the side effects of the initializers will be different 
depending on whether we do link such subsets separately or not. E.g. the 
initializer call order will change. The global state changes in one subset will 
not be visible in the other. Weak symbol resolution will produce different 
results. Etc.

> The idea is that users already get C++-like behavior with the new driver and 
> -fgpu-rdc generally

Yes. That will set the default expectations that things work just like in C++, 
which is a great thing. But introduction of partial subset linking will break 
some of those "just works" assumptions and it may be triggered by the parts of 
the build  outside of user's control (e.g. by a third-party library). 

Side note: we do need a good term for this kind of subset linking. "partial 
linking" already has established meaning and it's not a good fit here as we 
actually produce a fully linked GPU executable.

> we don't need to worry about people being confused so long as we document 
> what it does.

We do need to document how it works. Documenting what does not work, or works 
differently is also important, IMO. 
We *do* need to worry about users and their expectations. 




https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-31 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 updated 
https://github.com/llvm/llvm-project/pull/80066

>From af382e03e41ef679c35a6126a1b131a7a8a28360 Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Tue, 30 Jan 2024 15:34:22 -0600
Subject: [PATCH 1/3] [LinkerWrapper] Support relocatable linking for
 offloading

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the `.llvm.offloading` section so it can be
linked later. We also use a sepcial section called something like
`omp_offloading_entries` to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via `-r` to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
```
$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out
```

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

```
$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver 
-fgpu-rdc
$ ./a.out
```

One question is whether or not this should be the default behaviour of
`-r` when run through the linker-wrapper or a special option. Standard
`-r` behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.
---
 clang/test/Driver/linker-wrapper-image.c  |  8 ++
 clang/test/Driver/linker-wrapper.c| 32 +++-
 .../ClangLinkerWrapper.cpp| 78 +--
 .../llvm/Frontend/Offloading/OffloadWrapper.h |  4 +-
 .../Frontend/Offloading/OffloadWrapper.cpp| 11 ++-
 llvm/lib/Object/OffloadBinary.cpp |  2 +-
 6 files changed, 119 insertions(+), 16 deletions(-)

diff --git a/clang/test/Driver/linker-wrapper-image.c 
b/clang/test/Driver/linker-wrapper-image.c
index b5d8ae217a972..08f860f6cab0d 100644
--- a/clang/test/Driver/linker-wrapper-image.c
+++ b/clang/test/Driver/linker-wrapper-image.c
@@ -9,6 +9,8 @@
 // RUN:   -fembed-offload-object=%t.out
 // RUN: clang-linker-wrapper --print-wrapped-module --dry-run 
--host-triple=x86_64-unknown-linux-gnu \
 // RUN:   --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 | FileCheck %s 
--check-prefixes=OPENMP,OPENMP-ELF
+// RUN: clang-linker-wrapper --print-wrapped-module --dry-run -r 
--host-triple=x86_64-unknown-linux-gnu \
+// RUN:   --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 | FileCheck %s 
--check-prefixes=OPENMP-ELF,OPENMP-REL
 // RUN: clang-linker-wrapper --print-wrapped-module --dry-run 
--host-triple=x86_64-unknown-windows-gnu \
 // RUN:   --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 | FileCheck %s 
--check-prefixes=OPENMP,OPENMP-COFF
 
@@ -19,6 +21,8 @@
 //  OPENMP-COFF: @__start_omp_offloading_entries = weak_odr hidden 
constant [0 x %struct.__tgt_offload_entry] zeroinitializer, section 
"omp_offloading_entries$OA"
 // OPENMP-COFF-NEXT: @__stop_omp_offloading_entries = weak_odr hidden constant 
[0 x %struct.__tgt_offload_entry] zeroinitializer, section 
"omp_offloading_entries$OZ"
 
+// OPENMP-REL: @.omp_offloading.device_image = internal unnamed_addr constant 
[[[SIZE:[0-9]+]] x i8] c"\10\FF\10\AD{{.*}}", section 
".llvm.offloading.relocatable", align 8
+
 //  OPENMP: @.omp_offloading.device_image = internal unnamed_addr constant 
[[[SIZE:[0-9]+]] x i8] c"\10\FF\10\AD{{.*}}", section ".llvm.offloading", align 
8
 // OPENMP-NEXT: @.omp_offloading.device_images = internal unnamed_addr 
constant [1 x %__tgt_device_image] [%__tgt_device_image { ptr getelementptr 
inbounds ([[[BEGIN:[0-9]+]] x i8], ptr @.omp_offloading.device_image, i64 1, 
i64 0), ptr getelementptr inbounds ([[[END:[0-9]+]] x i8], ptr 
@.omp_offloading.device_image, i64 1, i64 0), ptr 

[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-31 Thread via cfe-commits

github-actions[bot] wrote:




:warning: C/C++ code formatter, clang-format found issues in your code. 
:warning:



You can test this locally with the following command:


``bash
git-clang-format --diff 7155c1ef658b66132f15bf1406e84e68eed3358f 
1a91a8a1a5bd0a0b6d47dd0e50e801a820400203 -- 
clang/test/Driver/linker-wrapper-image.c clang/test/Driver/linker-wrapper.c 
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp 
llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h 
llvm/lib/Frontend/Offloading/OffloadWrapper.cpp 
llvm/lib/Object/OffloadBinary.cpp
``





View the diff from clang-format here.


``diff
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp 
b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index 53a1bd9960..03b5471862 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -333,7 +333,7 @@ Error runLinker(ArrayRef Files, const ArgList 
) {
   if (Args.hasArg(OPT_relocatable))
 return relocateOffloadSection(Args, ExecutableName)
 
-  return Error::success();
+return Error::success();
 }
 
 void printVersion(raw_ostream ) {

``




https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-31 Thread Artem Belevich via cfe-commits


@@ -265,6 +329,11 @@ Error runLinker(ArrayRef Files, const ArgList 
) {
 LinkerArgs.push_back(Arg);
   if (Error Err = executeCommands(LinkerPath, LinkerArgs))
 return Err;
+
+  if (Args.hasArg(OPT_relocatable))
+if (Error Err = relocateOffloadSection(Args, ExecutableName))

Artem-B wrote:

We could just `return relocateOffloadSection(Args, ExecutableName)`

https://github.com/llvm/llvm-project/pull/80066
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-30 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)


Changes

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the `.llvm.offloading` section so it can be
linked later. We also use a sepcial section called something like
`omp_offloading_entries` to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via `-r` to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
```
$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out
```

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

```
$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver 
-fgpu-rdc
$ ./a.out
```

One question is whether or not this should be the default behaviour of
`-r` when run through the linker-wrapper or a special option. Standard
`-r` behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.


---
Full diff: https://github.com/llvm/llvm-project/pull/80066.diff


5 Files Affected:

- (modified) clang/test/Driver/linker-wrapper.c (+2-1) 
- (modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+64-7) 
- (modified) llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h (+3-1) 
- (modified) llvm/lib/Frontend/Offloading/OffloadWrapper.cpp (+7-4) 
- (modified) llvm/lib/Object/OffloadBinary.cpp (+1-1) 


``diff
diff --git a/clang/test/Driver/linker-wrapper.c 
b/clang/test/Driver/linker-wrapper.c
index a8667c99977c5..21898faf295d4 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN:   --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a 
--no-whole-archive \
 // RUN:   %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK
 
-// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img 
--target=x86_64-unknown-linux-gnu
+// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img 
--target=x86_64-unknown-linux-gnu
 // RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r
+// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp 
b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index b682cc293d54b..53412eb2346de 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -241,6 +241,63 @@ Expected findProgram(StringRef Name, 
ArrayRef Paths) {
   return *Path;
 }
 
+/// Returns the hashed value for a constant string.
+std::string getHash(StringRef Str) {
+  llvm::MD5 Hasher;
+  llvm::MD5::MD5Result Hash;
+  Hasher.update(Str);
+  Hasher.final(Hash);
+  return llvm::utohexstr(Hash.low(), /*LowerCase=*/true);
+}
+
+/// Renames offloading entry sections in a relocatable link so they do not
+/// conflict with a later link job.
+Error relocateOffloadSection(const ArgList , StringRef Output) {
+  Expected ObjcopyPath =
+  findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")});
+  if (!ObjcopyPath)
+return ObjcopyPath.takeError();
+
+  // Use the linker output file to get a unique hash. This creates a unique
+  // identifier to rename the sections to that is deterministic to the 
contents.
+  auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("")
+: MemoryBuffer::getFileOrSTDIN(Output);
+  if (!BufferOrErr)
+return createStringError(inconvertibleErrorCode(), "Failed to open %s",
+ 

[clang] [llvm] [LinkerWrapper] Support relocatable linking for offloading (PR #80066)

2024-01-30 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 created 
https://github.com/llvm/llvm-project/pull/80066

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the `.llvm.offloading` section so it can be
linked later. We also use a sepcial section called something like
`omp_offloading_entries` to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via `-r` to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
```
$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out
```

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

```
$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver 
-fgpu-rdc
$ ./a.out
```

One question is whether or not this should be the default behaviour of
`-r` when run through the linker-wrapper or a special option. Standard
`-r` behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.


>From 2172b4eda521025b779d7457ed762e3748f66db8 Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Tue, 30 Jan 2024 15:34:22 -0600
Subject: [PATCH] [LinkerWrapper] Support relocatable linking for offloading

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the `.llvm.offloading` section so it can be
linked later. We also use a sepcial section called something like
`omp_offloading_entries` to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via `-r` to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
```
$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out
```

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

```
$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver 
-fgpu-rdc
$ ./a.out
```

One question is whether or not this should be the default behaviour of
`-r` when run through the linker-wrapper or a special option. Standard
`-r` behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.
---
 clang/test/Driver/linker-wrapper.c|  3 +-
 .../ClangLinkerWrapper.cpp| 71