[PATCH] D99683: [HIP] Support ThinLTO

Teresa Johnson via Phabricator via cfe-commits Tue, 06 Apr 2021 15:38:19 -0700

tejohnson added a comment.

In D99683#2672578 <https://reviews.llvm.org/D99683#2672578>, @yaxunl wrote:


> In D99683#2672554 <https://reviews.llvm.org/D99683#2672554>, @tejohnson wrote:
>
>> This raises some higher level questions for me:
>>
>> First, how will you deal with other corner cases that won't or cannot be 
>> imported right now? While enabling importing of noinline functions and 
>> cranking up the threshold will get the majority of functions imported, there 
>> are cases that we still won't import (functions/vars that are interposable, 
>> certain funcs/vars that cannot be renamed, most non-const variables with 
>> non-trivial initializers).
>
> We will document the limitation of thinLTO support of HIP toolchain and 
> recommend users not to use thinLTO in those corner cases.
>
>> Second, force importing of everything transitively referenced defeats the 
>> purpose of ThinLTO and would probably make it worse than regular LTO. The 
>> main entry module will need to import everything transitively referenced 
>> from there, so everything not dead in the binary, which should make that 
>> module post importing equivalent to a regular LTO module. In addition, every 
>> other module needs to transitively import everything referenced from those 
>> modules, making them very large depending on how many leaf vs non-leaf 
>> functions and variables they contain. What is the goal of doing ThinLTO in 
>> this case?
>
> The objective is to improve optimization/codegen time by using multi-threads 
> of thinLTO. For example, I have 10 modules each containing a kernel. In full 
> LTO linking, I get one big module containing 10 kernels with all functions 
> inlined, and I have one thread for optimization/codegen. With thinLTO, I get 
> one kernel in each module, with all functions inlined. AMDGPU internalization 
> and global DCE will remove functions not used by that kernel in each module. 
> I will get 10 threads, each doing optimization/codegen for one kernel. 
> Theoretically, there could be 10 times speed up.

That will work as long as there are no dependence edges anywhere between the 
kernels. Is this a library that has a bunch of totally independent kernels only 
called externally?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99683/new/

https://reviews.llvm.org/D99683

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D99683: [HIP] Support ThinLTO

Reply via email to