[PATCH] D127901: [LinkerWrapper] Add PTX output to CUDA fatbinary in LTO-mode

Joseph Huber via Phabricator via cfe-commits Thu, 16 Jun 2022 14:54:35 -0700

jhuber6 added a comment.

In D127901#3590402 <https://reviews.llvm.org/D127901#3590402>, @tra wrote:


> Playing devil's advocate, I've got to ask -- do we even want to support JIT?
>
> JIT brings more trouble than benefits.
>
> - substantial start-up time on nontrivial apps. Last time I tried launching a 
> tensorflow app and needed to JIT its kernels, it took about half an hour 
> until JIT was done.
> - substantial increase in the size of the executable. Statically linked 
> tensorflow apps are already pushing the limits of the executables that use 
> small memory model (-mcmodel=small is the default for clang and gcc, AFAICT).
> - very easy to make a mistake, compile for a wrong GPU and not notice it, 
> because JIT will try to keep it running using PTX.
> - makes executables and tests non-hermetic -- the code that will run on GPU 
> (and thus the behavior) will depend on particular driver version the apps 
> uses at runtime.
>
> Benefits: It *may* allow us to run a miscompiled/outdated CUDA app. Whether 
> it's actually a benefit is questionable. To me it looks like a way to paper 
> over a problem.
>
> We (google) have experienced all of the above and ended up disabling PTX 
> JIT'ting altogether.
>
> That said, we do embed PTX by default at the moment, so this patch does not 
> really change the status quo, so I'm not opposed to it, as long is we can 
> disable PTX embedding if we need/want to.

I guess it's one of those situations where I figured since we have it when we 
do LTO anyway I may as well add it. I don't know much about the usage of it 
w.r.t. performance, but I figured that this was a shortcoming of the RDC-mode 
support for Clang considering that NVIDIA can JIT RDC-mode compilations. We 
could definitely have an argument that disables this, I'm assuming there's an 
argument that does that in Clang already that we could overload to pass 
something to the linker wrapper. Or we could decide which behaviour we want to 
be the default.

The problem with LTO however is that many "compile-only" flags are suddenly 
relevant during linking. So let's say for a build someone did `clang foo.cu -c 
-no-embed-ptx -foffload-lto` and then `clang foo.o` we won't have the argument. 
I think regular LTO can embed the command line in the bitcode or something. We 
also have the option to embed the arguments in the binary format I made.

Also one problem with the RDC mode support with this is that we don't 
gracefully error if something was wrong with the image. so the following is 
really unhelpful

  clang app.cu --offload-arch=sm_<not correct> -fgpu-rdc --offload-new-driver
  ./a.out // Gives no output, kernel simply never executes.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D127901/new/

https://reviews.llvm.org/D127901

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D127901: [LinkerWrapper] Add PTX output to CUDA fatbinary in LTO-mode

Reply via email to