tra added a comment.

In D127901#3603118 <https://reviews.llvm.org/D127901#3603118>, @jhuber6 wrote:

> In D127901#3603006 <https://reviews.llvm.org/D127901#3603006>, @tra wrote:
>
>> Then we do need a knob controlling whether we do want to embed PTX or not. 
>> The default should be "off" IMO.
>> We currently have `--[no-]cuda-include-ptx=` we may reuse for that purpose.
>
> We could definitely re-use that. It's another option that probably need to go 
> inside the binary itself since normally those options aren't passed to the 
> linker.

I'm not sure I follow. WDYM by "go inside the binary itself" ? I assume you 
mean the per-GPU offload binaries inside per-TU .o. so that it could be used 
when that GPU object gets linked into GPU executable?

What if different TUs that we're linking were compiled using 
different/contradictory options?

The problem is that conceptually "--cuda-include-ptx" option ultimately affects 
the final GPU executable. If we're in RDC mode, then PTX is probably useless 
for JITT-ing purposes, as you can't link PTX and create the final executable. 
Well, I guess it might sort of be possible by concatenating the .s files and 
adding bunch of forward declarations for the functions, and merging debug info, 
and removing duplicate weak functions,,... Well, basically by writing a linker 
for a new "PTX" architecture. Doable, but so not worth it, IMO.

TUs are compiled to IR, then PTX generation shifts to the final link phase. I 
think we may need to rely on the user to supply PTX controls there explicitly. 
Or, at the very least, check that `cuda-include-ptx` propagated from TUs is 
used consistently in all TUs.

> We'll probably just use the same default as that flag (which is on I think).
>
>> This brings another question -- which GPU variant will we generate PTX for? 
>> One? All (if more than one is specified)? The ones specified by 
>> `--[no-]cuda-include-ptx=` ?
>
> Right now, it'll be the one that's attached to the LTO job. So if the user 
> specified `sm_70` they'll get PTX for `sm_70`.

I mean, when the user specifies more than one GPU variant to target. 
E.g. both `sm_70` and `sm_50`. 
PTX for the former would probably provide better performance if we run on a 
newer GPU (e.g. sm_80). 
On the other hand, it will likely fail if we were to attempt running from PTX 
on sm_60. 
Both would probably fail if we were to run on sm_35. Including all PTX variants 
is wasteful (Tensorflow-using applications are already pushing the limits on 
small memory model and sometimes fail to link due to the executable being too 
large).

The point is that there's no "one true choice" for the PTX architecture (as 
there's no safe/sensible choice for the offload target). Only the end user 
would know their intent. We do need explicit controls and a documented policy 
on what we produce by default.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D127901/new/

https://reviews.llvm.org/D127901

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to