Re: [pocl-devel] NVIDIA device backend for POCL

Pekka Jääskeläinen Sat, 23 Mar 2013 13:41:50 -0700

On 03/23/2013 08:33 PM, Peter Colberg wrote:
> On Sat, Mar 23, 2013 at 02:11:37PM -0400, Peter Colberg wrote:
>> If this transformation needs to be done at launch time anyway, we
>> could convert __local and __constant kernel arguments to automatic
>> arrays.


Did you manage to make the automatic locals work through
the CUDA driver API?

> To be precise, the LLVM IR of an OpenCL kernel such as
>
>    __kernel void func(__global uint *a,
>                       __local uint *b,
>                       __local uint *c,
>                       __constant uint *d,
>                       __constant uint *e)
>    {
>
> would be transformed to the IR equivalent of
>
>    __constant uint d[d_size];
>    __constant uint e[e_size];
>
>    __kernel void func(__global uint *a)
>    {
>      __local uint b[b_size];
>      __local uint c[c_size];
>
> Pointers to the __constant arrays would be obtained using
> cuModuleGetGlobal() and initialised using cuMemcpyDtoD().

Yes. If you have verified the function scope locals and
constants work then this is the way forward.

All the kernel compiler passes are under lib/llvmopencl.
Check llvmopencl/GenerateHeader.cc.

This badly named LLVM pass:

a) Produces the metadata header for the kernels which is compiled
to a .so and linked + dlopened to get the metadata finally in to the
host API. (Yes. This should be replaced with direct LLVM API calls
that loads the module and produce the information directly. :) )

b) Converts automatic locals to __local__ kernel arguments to
allow treating these two cases similarly in the rest of the pocl.
It's the ProcessAutomaticLocals() method.

Because there is no way (at least wasn't known when that was written) to
modify the function fingerprint directly, it creates a new function with
the modified fingerprint (with the extra local pointer args).

This part that handles the automatic locals should be actually separated to
another pass. I suppose it was included in this one because it changes the
kernel fingerprint and thus affects the metadata that gives info of the
(original) kernel arguments.

One thing to understand is that we prefer the locals as arguments in
the common case to implement thread safety when executing multiple
work-groups in parallel. Each work group gets their own local region
that way. I suppose this is handled automatically due to disjoint
address spaces in separate cores in the NVIDIA GPUs, or the runtime
handles it.

Anyways, you are attempting to do the opposite, i.e., gather all the local
pointer args with known sizes at that point, and convert them to
automatic locals.

LLVM bitcode represents the automatic locals as magic LLVM module global
values. See row 295 in GenerateHeader.cc:

if (i->getName().startswith(funcName + ".")) {
// Additional checks might be needed here. For now
// we assume any global starting with kernel
// name is declaring a local variable.
...

That is, they are LLVM module level globals in the local address space of
the device with a naming convention kernel_func_name.local_var_name.

You could create a new LLVM pass that copies this functionality
but reverses it: it removes the local arguments, or perhaps
converts them to dummy arguments to retain the original 1:1 mapping
(clSetKernelArg). This pass is then called in pocl-workgroup
when compiling for NVIDIA (or other targets that prefer the locals in
this format). How to pass the local argument sizes etc. to the pass it?
Perhaps as a command line switch.

Or you can call the libLLVM (the C bindings) directly from the driver instead
of using any scripts to launch your pass, this way starting to go towards the
cleaner interfacing with Clang/LLVM from the host runtime.

The script way of launching the LLVM passes has been very useful during the
kernel compiler development, but it's getting more and more of a nuisance.

General kernel compiler pass development info:

Check doc/envs.txt:

You can define export POCL_LEAVE_TEMP_DIRS=1 and
export POCL_TEMP_DIR=tmp. Then your kernel compilation
intermediate results (bitcodes etc.) are left to 'tmp' dir
(which has to exist before executing your test program).
You can inspect those bitcodes with 'llvm-dis' which dumps
the LLVM IR in a textual format.

Feel free to ask for further info. You might want to join the
IRC channel too.

HTH,
-- 
--Pekka


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] NVIDIA device backend for POCL

Reply via email to