tra added a comment.

I'm still itching to figure out a way to avoid CUID altogether and with the new 
driver it may be possible.
CUID serves two purposes:
a) avoid name conflicts during device-side linking ("must be globally unique" 
part)
b) allow host to refer to something in the GPU executable ("stable within TU" 
part)

My understanding that we already collect the data about all offloading entities 
and that include those we have to externalize. We also postpone generation of 
the registration glue to the final linking step.

Let's suppose that we do not externalize those normally-internal symbols. The 
offloading table would still have entries for them, but there will be no issue 
with name conflicts during linking, as they do remain internal.
During the final linking, if an an offloading entity uses a pointer w/o a 
public symbol, we would be in position to generate a unique one, using the 
pointer value in the offload table entry. Linker can just use a free-running 
counter for the suffix, or could just generate a completely new symbol. It does 
not matter.
When we generate the host-side registration glue, we'll use the name of that 
generated symbol.

In the end linking will work exactly as it would for C++ (modulo having 
offloading tables) and host/device registration will be ensured by telling host 
side which symbols to use, instead of assuming that we've happened to generate 
exactly the same unique suffix on both sides.

@yaxunl -- do you see any holes in this approach?



================
Comment at: clang/lib/CodeGen/CodeGenModule.cpp:6836
+
+  // If the CUID is not specified we try to generate a unique postfix.
+  if (getLangOpts().CUID.empty()) {
----------------
jhuber6 wrote:
> jhuber6 wrote:
> > tra wrote:
> > > jhuber6 wrote:
> > > > tra wrote:
> > > > > > However, [CUID] is not always availible. 
> > > > > 
> > > > > The question is -- when and why is it not available? I'm getting the 
> > > > > feeling that we're fixing the consequence here, not the root cause.
> > > > > 
> > > > > Is there a reason we can't make sure that the driver always generates 
> > > > > a cuid for offload subcompilations and error out if it's needed but 
> > > > > is not provided?
> > > > > That would make this fallback unnecessary and would be a more robust 
> > > > > approach in general.
> > > > > 
> > > > So, I'm more in favor of this approach because it doesn't require extra 
> > > > intervention from the compiler driver, this makes it less convoluted to 
> > > > do split compilation since we don't have an extra arguments. The way I 
> > > > would prefer it, is that we do this implicitly by default without 
> > > > requiring extra thought from the driver, but if it's not good enough we 
> > > > can support the manual `CUID` approach to let the user override it. I 
> > > > think this is a cleaner implementation, and is mostly coming from my 
> > > > support for CUDA in the new driver which currently doesn't implement 
> > > > the CUID as we do with the old driver. Generally I'd prefer things to 
> > > > behave independent of the driver, so we can consider host and device 
> > > > compilation more separately.
> > > > So, I'm more in favor of this approach because it doesn't require extra 
> > > > intervention from the compiler driver
> > > 
> > > We need the driver intervention for any cc1 compilations anyways, so this 
> > > does not buy us anything.  While you can run a sub-compilation manually 
> > > with handcrafted cc1 flags, that's not a practical use case. The driver 
> > > is the ultimate source of cc1 flags.
> > > 
> > > > this makes it less convoluted to do split compilation since we don't 
> > > > have an extra arguments.
> > > 
> > > For CUDA/HIP sub-compilation should be done with clang 
> > > --cuda-host-only/--cuda-device-only.  Whether the driver supplies yet 
> > > another cc1 option, --cuid=... makes no difference to the user launching 
> > > such sub-compilation. 
> > > 
> > > > The way I would prefer it, is that we do this implicitly by default 
> > > > without requiring extra thought from the driver, but if it's not good 
> > > > enough we can support the manual CUID approach to let the user override 
> > > > it.
> > > 
> > > I agree that we can come up with something that will almost always work. 
> > > Possibly even good enough for all practical purposes. However, if a 
> > > better solution would take comparable effort, it would make sense to do 
> > > things right and avoid adding technical debt. 
> > > 
> > > On the other hand, requiring the driver to supply identical cuid to all 
> > > sub-compilations appears to be a better approach to me:
> > > * Driver is the best place to do it, functionally. Driver has access to 
> > > all user-provided inputs and is in position to guarantee that all 
> > > subcompilations get the same cuid.
> > > * Calculating CUID in the driver keeps relevant logic in one place. Doing 
> > > it in the driver *and* in the codegen 
> > > * Figuring out what inputs are relevant for calculation of CUID in cc1 
> > > invocation is error prone. E.g. we have to guess which cc1 options are 
> > > relevant or not and is the driver would pass a macro to one 
> > > subcompilation but not to another, we would end up generating mismatching 
> > > CUID and would not have any way to notice that. Even when that's not the 
> > > case, we would need to guess which flags, supplied by the driver, are 
> > > relevant. At CC1 level that may be somewhat complicated as top-level 
> > > options may expand to quite a few more cc1 options. E.g. we'll need to 
> > > take into account `-std=...`, `--cuda-path=`, `-include ...`, `-I` (and 
> > > other include paths)... All of that does not belong to the codegen.
> > > 
> > > The driver is already doing CUID computation, so I do not see any 
> > > downsides to just letting it do its job, and I do believe it will be a 
> > > better, and likely less complicated, solution.
> > > 
> > > > ... mostly coming from my support for CUDA in the new driver which 
> > > > currently doesn't implement the CUID as we do with the old driver
> > > 
> > > Right. That appears to be the key missing piece.
> > > 
> > > What are the obstacles for having CUID calculation done in the new 
> > > driver. It should have all the info it needs. What am I missing?
> > > 
> > > For CUDA/HIP sub-compilation should be done with clang 
> > > --cuda-host-only/--cuda-device-only. Whether the driver supplies yet 
> > > another cc1 option, --cuid=... makes no difference to the user launching 
> > > such sub-compilation.
> > The problem I have with this is that we use the command line to generate 
> > the value, so they aren't going to be the same without the user manually 
> > specifying it. I guess we could filter out only "relevant" command line 
> > flags, maybe that's an option. I just think it's not intuitive for a name 
> > mangling scheme to depend on something external, but there's definitely 
> > advantages to doing it that way.
> > 
> > I can see your point for the Driver handling this stuff. Now that I'm 
> > thinking about it I don't think looking at the macros or the other 
> > arguments is a sound solution in the first place. Even without that it 
> > would work for almost all the same cases just using the file's unique ID. 
> > Without that, this solution is guaranteed not to conflict with any other 
> > file on the same file system at the time of compilation. This, as we 
> > discussed, potentially fails for non-static source trees and compiling the 
> > same file twice and linking it. The current CUID implementation fails on 
> > the former, this method fails on both.
> > 
> > If the CUID didn't exist, the way I would have implemented it would simply 
> > be with the File-ID, and have the CUID be a simple marshalling option that 
> > lets the user override it to something unique if needed. I personally think 
> > that's simpler for 99.99% of cases and has an easy-out in the last 0.01%. 
> > Given that it already exists there's some desire to keep it since the work 
> > has already been done I understand.
> > 
> > > What are the obstacles for having CUID calculation done in the new 
> > > driver. It should have all the info it needs. What am I missing?
> > It's less of a difficulty in implementing and more hoping we could make the 
> > name mangling more simple and work by default without the driver. 
> > Also, we may need this support for a single case in OpenMP, and I'd prefer 
> > not need to generate the CUID for OpenMP offloading when it's unused the 
> > vast majority of the time. Generally I'd prefer if compiling for the host / 
> > device was conceptually the same to the user without requiring external 
> > values. If we're sold on the CUID method I can go forward with that, but 
> > from my perspective what it's buying us is the ability to compile the 
> > following
> > ```
> > static __device__ int a;
> > 
> > #ifdef MACRO
> >   do_something_with(a);
> > #else
> >   do_something_else();
> > #endif
> > ```
> > ```
> > clang foo.cu -DMACRO -c -o 1.o
> > clang foo.cu 1.o
> > ```
> > 
> > This is just a tough problem overall, I don't think there's a single 
> > perfect solution. Whatever we choose we'll be trading reproducibility for 
> > correctness or whatever. You have more seniority in this space so it's your 
> > call what you think I should go forward with. 
> Also, it's incredibly convoluted, but I can think of a way to break even the 
> current CUID for this.
> ```
> static __device__ int a;
> 
> __device__ int __attribute__((weak)) *a_ref = &a;
> ```
> ```
> $ clang a.cu -c -fgpu-rdc
> $ mv a.o b.o
> $ clang a.cu -c -fgpu-rdc
> $ nvlink a.o b.o -arch=sm_35 -o out.cubin
> nvlink error   : Multiple definition of '_ZL1a__static__d041026c8e4167e6' in 
> '1.o', first defined in 'a.o'
> nvlink fatal   : merge_elf failed
> ```
> The problem I have with this is that we use the command line to generate the 
> value, so they aren't going to be the same without the user manually 
> specifying it. I guess we could filter out only "relevant" command line 
> flags, maybe that's an option. I just think it's not intuitive for a name 
> mangling scheme to depend on something external, but there's definitely 
> advantages to doing it that way.

I'm not sure I follow the "they aren't going to be the same without the user 
manually specifying it." part. Do you mean that CUIDs passed to 
sub-compilations would not be same?
If so, why would that be the case? If would be up to the driver to pick the 
same set of inputs to hash into the cuid. We only case about single 
compilation. Separately compiling host/device with --cuda-host/device-only 
makes it two different compilations, which we may or may not provide any 
guarantees about. In case we don't we can document that it would be up to user 
to ensure consistency between host/device objects by using explicit --cuid 
argument. Within single top-level compilation the driver should have no problem 
picking single cuid value and passing it on to all subcompilations. 

> I don't think looking at the macros or the other arguments is a sound 
> solution in the first place.

They are part of the compilation input set, along with include-related options 
and, likely, options like `-std` that also affect the sources seen by compiler.

If we have to generate globally-stable cuid within a cc1 compilation, we have 
to take as much of the relevant input set for the compilation as practical. I 
believe preprocessor-related options are relevant to existing use patterns. 
E.g. compiling the same source with different preprocessor definitions does 
happen. 

We're dealing with more than one issue here.
* who/where is responsible for CUID generation:
  - driver only
  - CC1 only
  - driver as the primary source of CUID and CC1 as the fallback.

* how do we guarantee CUID stability within single TU compilation, while 
ensuring global uniqueness.
  - We can guarantee build-wise uniqueness if we delegate CUID generation to 
the build system which does know about all compilations and can simply 
enumerate all of them.
  - We can not generate globally unique CUID strictly within clang, whether by 
driver or by CC1. In both cases we'll have some chance of collisions and will 
need a way to deal with them.
  - Driver can guarantee within-compilation stability by generating CUID once 
and passing it to CC1 instances.
  - Generating CUID within CC1 relies on all CC1 instances producing the same 
CUID value. It's feasible if we can guarantee that all CC1 instances always 
operate on identical set of inputs taken into account during CUID generation. 
That is a dependency on implementation details as those inputs would likely 
depend on what the driver does. Can we make it work? Probably. But why? 

> It's less of a difficulty in implementing and more hoping we could make the 
> name mangling more simple and work by default without the driver.

I do not think "without the driver part" (e.g. directly running -cc1)  is a 
good metric for driving compiler development. It's the driver's explicit 
purpose to hide the complexity of the actual compiler command line. 

If you think there's a practical use case of fallback cuid generation in cc1, I 
would consider it if it were done in parallel with driver-generated CUID during 
regular top-level compilation. I.e. `clang a.cu` would run `clang -cc1 
--cuid.=<driver-provided-cuid>`, but if one runs `clang -cc1` w/o --cuid, one 
would be generated for them internally. I would still prefer to see a warning 
for that, because existence of CUID will be something not obvious to the users 
and it would likely be very easy to end up with mismatched CUIDs used between 
the host and device compilations.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125904/new/

https://reviews.llvm.org/D125904

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to