Re: [pocl-devel] "Some" questions on new device development

Pekka Jääskeläinen Fri, 16 Aug 2013 02:20:54 -0700

Hi Alun,

I'll answer some of your questions and hope that Kalle can
fill in with the ARM related ones.

On 08/16/2013 12:18 AM, Alun Evans wrote:
> It seems that cl_device_id.uninit() is never called?

It seems so. Now that I look at the code (clReleaseDevice, which would be
the most logical place), I recall there was an unsolved problem that multiple
contexts use the same device structs so the uninit() should be called only
after all contexts that use them are released. The device driver table is
initialized in context creation (only the first time) where the init() is
also called for each instance ('data') of the device driver.

So the question it boils down to is what to put in the uninit(). One does
not want to free everything as there can be new contexts that want to use
the device again and initializing everything from the scratch would be
too costly. Maybe the callback should be defined as a partial uninit
that should be called whenever a cl_device_id reference counter goes to 0.

> I notice in devices.c:pocl_init_devices(), that there is either an
> environment variable:
>
>      device_list = getenv(POCL_DEVICES_ENV);
>
> or a fallback to a hard coded list:
>
>      device_list = <list>;
>
> I was wondering what the thoughts were on adding a device API like
> .discover() or .probe() so that the driver can report whether it found
> any of its devices ?

Yes, we have thought about that. Especially when/if there's GPU support
e.g. via the Gallium compute or the libcuda, the device discovery should
make life easier for the usual CPU+GPU case.

For now it seems to be not so useful because not always you want to
use all supported devices. E.g., when we want to simulate TTA devices,
we explicitly add a TTA device to the list. The device discovery API
you propose would always add the TTA device when pocl is compiled with
TCE support? The current default of the single pthread CPU device is rather
safe for the default list.

> i.e. something like:
>
>      pocl_device_types[i].discover();
>
> Also, I've been toying with the idea of having multiple devices of the
> same type, e.g. like if you plugged two GPUs into the same board. Any
> thoughts on that?

This is already implemented. One can "instantiate" the device
drivers multiple times. It's up to the device driver then handle
resource sharing or identifying what is the actual physical hardware
to use (e.g. with device parameters or just some default ordering).
It's just not documented, it seems.

E.g., this should work and create 4 devices where the 2nd is
a single threaded simple CPU device, and the rest use the pthread
threading for parallel WG execution:

export POCL_DEVICES="pthread basic pthread pthread"

> --------------------------------------------------------------------------------
>
> I'm building the host to be staged installed on x86_64, and when I
> installed pocl onto my staged system, I've got this ICD file:
>
>      bash$ cat /etc/OpenCL/vendors/pocl.icd
>      libpocl.so.1.2.0
>
> yet annoyingly the ICD can't find the pocl lib when it does dlopen()
> on it since dlopen only checks:
>
>         o   The cache file  /etc/ld.so.cache  (maintained  by  ldconfig(8))  
> is
>             checked to see whether it contains an entry for filename.
>
>         o   The directories /lib and /usr/lib are searched (in that order).
>
> And I have:
>
>      bash$ ldconfig -p | grep pocl
>          libpoclu.so.1 (libc6,x86-64) => /usr/local/lib/libpoclu.so.1
>          libpoclu.so (libc6,x86-64) => /usr/local/lib/libpoclu.so
>          libpocl.so.1 (libc6,x86-64) => /usr/local/lib/libpocl.so.1
>          libpocl.so (libc6,x86-64) => /usr/local/lib/libpocl.so
>
> Changing pocl.icd.in <http://pocl.icd.in> to:
>
>      @libdir@/libpocl.so.VER
>
> Seemed to help me out there...

It used to be an absolute path like that, but someone (Kalle?) changed it
due to some problem it caused. I think it was because you can change the
PREFIX at build time (or even at install time?) so it was hard to
(re)generate the pocl.icd correctly, or something related to that.

The ICD specs state it's OK to rely on the dynlib search paths:
http://www.khronos.org/registry/cl/extensions/khr/cl_khr_icd.txt
"Note that the library specified may be an absolute path or just a file
name."

So perhaps you can just add /usr/local/lib to ld.so.conf or LD_LIBRARY_PATH?

> --------------------------------------------------------------------------------
>
> Looking at this hunk in lib/CL/clCreateBuffer.c:
>
>        device_ptr = device->malloc(device->data, flags, size, host_ptr);
> ...
>        if (flags & (CL_MEM_ALLOC_HOST_PTR | CL_MEM_USE_HOST_PTR))
>          mem->mem_host_ptr = host_ptr;
>
> I'm confused by the the _ALLOC_HOST_PTR. From the OpenCL doc:
>
>      This flag specifies that the application wants the OpenCL
>      implementation to allocate memory from host accessible memory.
>
> i.e. since I don't see host_ptr set anywhere, and I'm pretty sure
> host_ptr will be NULL in this case?

It's seems a bit confusing but it should work. If the device
driver can allocate a buffer from host-shared memory (like it always does
when using a CPU device driver) the returned pointer does not need special
handling. If the device driver cannot do that, it returns an error and exits
early there.

Here the confusion comes from the fact that mem_host_ptr should not be
assigned in case of device allocated host memory as that should be used only
when the app code has allocated the buffer. Thus I think the if should be
changed to not assign the host_ptr (as it can be whatever) when 
CL_MEM_ALLOC_HOST_PTR is set.

> --------------------------------------------------------------------------------
>
> What are your thoughts on debian and/or rpm packaging of libpocl ?

There has been activity for Debian packaging (though I haven't heard
of Vincent Danjean's deb efforts for a while) and this week
fabiand at the IRC channel has been talking about RPM packaging problems.
So, if you want to help making the packages happen, you might want to
discuss with these people.

> --------------------------------------------------------------------------------
>
> I noticed cellspu.c seems to be using:
>
>        al = &(kernel->dyn_arguments[i]);
>
> Whereas all the other devices are using run.arguments ?

OK, the clSetKernelArg bug is not fixed in that driver. Unfortunately I
cannot fix this as I do not have a working Cell setup to test with. Also
it's unfortunate that LLVM removed the support for SPU in 3.2 :I

> --------------------------------------------------------------------------------
>
> There is this hunk in the OpenCL spec:
>
>      If the argument is declared with the __local qualifier, the
>      arg_value entry must be NULL
>
> yet I can generate:
>
>      arg[0]: Local arg size 8 (0x92d838)
>
> Should this be detectable as an error somewhere?

Yes, there are several unimplemented spots of the specs here and there.
Patches are very much welcome to fill the holes.

> --------------------------------------------------------------------------------
>
> When I compile, I see a few gcc warnings fly past, any plans for
> "-Wall -Werror" ?

Sure, we can add the strict warning flags if someone makes the
code base build first with them. Did I mention patches are welcome? :-)

> --------------------------------------------------------------------------------
>
> I notice in pocl_device.h there is :
>
>      typedef void (*pocl_workgroup) (void **, struct pocl_context *);
>
> Leading to:
>
>      void *arguments[kernel->num_args + kernel->num_locals];
>
>      w (arguments, pc);
>
> Does this mean all arguments are effectively promoted to sizeof(void *)?
>
> i.e. I guess there is no problem with 16-bit and 8-bit scalar
> arguments here.
>
> What happens if the host has 64-bit pointers and the device has 32-bit
> pointers ? I guess *I* should carefully setup the argument list before
> sending to the device and construct it in 32-bit quantity.

Yes you should know how your device is commanded to execute the kernels
in the driver.  See for an example the TCE driver where we have a simple
protocol of pushing kernel commands to the global memory of the device to
control the execution.

Beware that there's a known related issue here that also affects the
CPU devices:
https://bugs.launchpad.net/pocl/+bug/987905

You can see two XFAILing test cases (passing vectors and structs as scalars)
that produce issues  coming from the different way of passing aggregate values 
to functions. I've been thinking that we could try to
fix these issues by using the SPIR target of LLVM for kernels to make sure
their calling convention maps 1:1 to the clSetKernelArg parameter order
(everything passed by pointers). Not sure when I have time to do
anything about this, though.

> --------------------------------------------------------------------------------
>
> As I'm building the host to be staged installed on x86_64, as well as
> compiling an arm target, so I'm passing the following to configure:
>
>      LLVM_CONFIG=/home/alun/local-llvm/usr/local/bin/llvm-config
>
> (Since I'm doing a DESTDIR:=~/local-llvm/ install, then tarball)
>
> This of course means that I have a problem with @CLANG@
>
>      config.h:
>
>      /* LLVM compiler executable. */
>      #define LLC "/home/alun/local-llvm/usr/local/bin/llc"
>
>      /* clang executable. */
>      #define CLANG "/home/alun/local-llvm/usr/local/bin/clang"
>
>
> So far I've just been hacking the scripts/pocl-* but some nicer
> solution would be good...?

You could try Kalle's work which avoids the use of the
scripts altogether and uses Clang and LLVM via their
library APIs. ./configure --enable-llvmapi

> - I guess I don't care about these paths:
>
> *** lib/CL/devices/common.c:llvm_codegen()
>      CLANG " -target %s %s -c -o %s.o %s",
>      LLC " " HOST_LLC_FLAGS " -o %s %s",
>      LINK_CMD " -target "OCL_KERNEL_TARGET
>
> Unless I begin to try to call llvm_codegen() from my new
> device.run()... but more about that in a minute...
>
> --------------------------------------------------------------------------------
>
> So that was the trivial stuff :) Now onto the harder issue.
>
> I'm reasonably familiar with the autoconf host/build/target
> selections, but I know it's a pain when you have multiple targets,
> i.e. like when I've been building LLVM with:
>
>      --enable-targets=arm,x86,x86_64
>
> For building pocl, it seems like the goal of configure.ac
> <http://configure.ac> is to get
> this triple of variables:
>
>      OCL_TARGETS, KERNEL_DIR, OCL_KERNEL_TARGET
>
> And the default AC $TARGET stuff is unused.
>
>
> KERNEL_DIR/ OCL_KERNEL_TARGET then get used as the llvm_target_triplet
> and such in :
>
>      lib/CL/devices/basic/basic.h
>      lib/CL/devices/pthread/pocl-pthread.h
>
>
> But they also get used in
>
> *** lib/CL/devices/common.c:llvm_codegen()
>      CLANG " -target %s %s -c -o %s.o %s",
>      LLC " " HOST_LLC_FLAGS " -o %s %s",
>      LINK_CMD " -target "OCL_KERNEL_TARGET
>
> - certainly meaning I can use llvm_codegen() unless I refactor it to
>    requre the target and options passed in.

Right. basic and pthread device drivers are assumed to be used for
the case where host=device. So to use the same helper functions for
code generating to another target, this needs to take in the target at least,
so some refactoring is needed.

> Then I end up with a problem in lib/kernel/arm/Makefile.am
>
> Since I have:
>
>      KERNEL_TARGET=@OCL_KERNEL_TARGET@
>      TARGET_DIR=arm
>
> and that'll get me issues like:
>
> ,----
> | /home/alun/local-llvm/usr/local/bin/clang -Xclang -ffake-address-space-map
> -emit-llvm   -fsigned-char -c -target x86_64-unknown-linux-gnu -o
> add_sat.cl.bc -x cl ./../add_sat.cl <http://add_sat.cl> -include
> ../../../include/arm/types.h -include /home/alun/work/pocl-2/include/_kernel.h
> | In file included from <built-in>:158:
> | In file included from <command line>:2:
> | /home/alun/work/pocl-2/include/_kernel.h:223:1: error: static_assert failed
> |       "size_t"
> | _CL_STATIC_ASSERT(size_t, sizeof(size_t) == sizeof(void*));
> | ^                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> | /home/alun/work/pocl-2/include/_kernel.h:76:37: note: expanded from macro
> |       '_CL_STATIC_ASSERT'
> | #  define _CL_STATIC_ASSERT(_t, _x) _Static_assert(_x, #_t)
> |                                     ^
> `----
>
> I notice cellspu is avoiding this world with :
>
>      CLANGFLAGS += -target cellspu-v0
>      CLANG_DEFAULT_INCLUDES = -include $(top_builddir)/include/cellspu/types.h
>
> I guess this is the issue of ARM being a host and device target, and I
> was wondering what thoughts you had there?

I think your scenario hasn't been accounted for. I.e., there has been
an assumption that ARM is (also) the OpenCL host (and the pocl build host),
so no need to compile it in the "heterogeneous device" mode where 
OCL_KERNEL_TARGET should not be detected from the current host, but
fixed by hand.

> Currently I've tweaked the arm/Makefile.am to :
>      KERNEL_TARGET:=arm-linux-gnueabihf
>      TARGET_DIR=arm
>      EXTRA_CLANGFLAGS:= \
>    -mcpu=cortex-a9 \
>    -mfloat-abi=softfp \
>    -mfpu=neon \
>    -mfpu=vfpv3
>
>
> But this then means that while the kernel-arm-linux-gnueabihf.bc gets
> made correctly, the kernel doesn't, so I get:
>
> ,----
> | WARNING: Linking two modules of different target triples:
> | /usr/local/lib/pocl/arm/kernel-arm-linux-gnueabihf.bc:
> | 'armv7-linux-gnueabihf' and 'armv4t-linux-gnueabihf'
> `----
>
> i.e. I think in addition to the llvm target triple, the device driver
> needs someway to set the compilation flags? I wonder if it shouldn't
> be something like:
>
>      Usage: $0 [-t <llvm_target_triplet> -f <flags>] -o output input
>

Check the llvm_target_triplet field in the device driver struct.

> btw I had to remove this section in pocl-kernel.in <http://pocl-kernel.in>:
>
> ,----
> | #pure clang doesn't allow "-target tce-tut-llvm"
> | case $target in
> |   tce-*)
> |     target_flags="" ;;
> |   *)
> |     target_flags="-target $target";;
> | esac
> | @CLANG@ @HOST_CLANG_FLAGS@ $target_flags -c -o ${output_file}.o -x c - <<EOF
> `----
>
> Since I was getting:
>
> ,----
> | /usr/local/bin/clang -target arm-linux-gnueabihf -c -o
> /tmp/poclN2oGr9/newdev/dot_product/descriptor.so.o -x c -
> | /usr/bin/as: unrecognized option '-mfloat-abi=softfp'
> `----
>
> i.e. I'm not sure we want -target here for a host compilation ?

The correct and compatible ARM target triplets have been a huge mess which
Kalle knows better.

> One last point, I also noticed that configure.ac <http://configure.ac> 
> declares
> TARGET_LLC_FLAGS, but it seems to be unused.

OK, some leftovers from before the target cleanup. A problem here is that
one can have multiple targets (devices) so to which target one should pass
these flags and to which not?

-- 
Pekka

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] "Some" questions on new device development

Reply via email to