[PATCH] D86376: [HIP] Improve kernel launching latency

Yaxun Liu via Phabricator via cfe-commits Wed, 26 Aug 2020 09:29:31 -0700

yaxunl added a comment.

In D86376#2236704 <https://reviews.llvm.org/D86376#2236704>, @tra wrote:


> 



> It's still suspiciously high. AFAICT, config/push/pull is just an std::vector 
> push/pop. It should not take *that* long.  Few function calls should not lead 
> to microseconds of overhead, once linker has resolved the symbol, if they 
> come from a shared library.
> https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590
>
> I wonder if it's the logging facilities that add all this overhead.

You are right. The 19 us are mostly due to overhead from rocprofiler. If I do 
not use rocprofiler and use a simple loop to measure execution time of 
`__hipPushCallConfigure/__hipPopCallConfigure`, I got 180 ns.

>> The kernel launching latency are measured by a simple loop in which a simple 
>> kernel is launched then hipStreamSynchronize is called. trace is collected 
>> by rocprofiler and the latency is measured from the end of 
>> hipStreamSynchronize to the real start of kernel execution. Without this 
>> patch, the latency is about 77 us. With this patch, the latency is about 46 
>> us. The improvement is about 40%. The decrement of 31 us is more than 19 us 
>> since it also eliminates the overhead of kernel stub.
>
> This is rather surprising. A function call by itself does *not* have such 
> high overhead. There must be something else. I strongly suspect logging. If 
> you remove logging statements from push/pop without changing anything else, 
> how does that affect performance?

The 19 us overhead was due to rocprofiler. Without rocprofiler, I can only 
measure the average duration of a kernel launching together with 
hipStreamSynchronize. When the kernel is empty, it serves as an estimation of 
kernel launching latency. With such measurement, the latency is about 14.0 us. 
The improvement due to this patch is not significant.

>> In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the 
>> kernel symbol.
>
> Do you mean the host-side symbol, registered with the runtime that you've 
> described above? Or do you mean that the device-side symbol is somehow 
> visible from the host side. I think that's where HIP is different from CUDA.

I mean the host-side symbol. A host program can only use host-side symbol to 
launch a kernel.

> I do not follow your reasoning why the stub name is a problem. It's awkward, 
> yes, but losing the stub as a specific kernel entry point seems to be a real 
> loss in debugability, which is worse, IMO.
> Could you give me an example where the stub name causes problems?

For example, in HIP program, there is a kernel `void foo(int*)`. If a C++ 
program wants to launch it, the desirable way is

  void foo(int*);
  hipLaunchKernel(foo, grids, blocks, args, shmem, stream);

Due to the prefixed kernel stub name, currently the users have to use

  void __device_stub_foo(int*);
  hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D86376: [HIP] Improve kernel launching latency

Reply via email to