Re: [Cython] Adding GPU support to cython

2020-01-24 Thread Stefan Behnel
Hi Frank,

sorry for leaving this unresponded for a while. I'm far from an expert in
this, but it looks like no-one else jumped in, so here's my response.


Schlimbach, Frank schrieb am 06.01.20 um 12:09:
> I would like to work on extending cython with a way to offload cython code to 
> a GPU. I found to related CEPs 
> (https://github.com/cython/cython/wiki/enhancements-opencl and 
> https://github.com/cython/cython/wiki/enchancements-metadefintions).

So, just for a bit of context, these CEPs were written a decade ago, and
Cython's code base, feature set, and the general Python ecosystem have
evolved a lot since then. For example, "cython.parallel" and "prange()"
weren't even implemented back then.


> My current thinking is that a solution along the OpenCL CEP is most 
> effective, it does require many code changes and seems to provide a good 
> tradeoff between usability and efficiency.
> 
> I would like to suggest a few modifications to this approach, like
> 
>   *   using SYCL instead of OpenCL to closely follow existing parallel/prange 
> semantics more easily
>   *   selecting the device (CPU, GPU) per region rather than per file
>   *   maybe allowing calling appropriately annotated and written external 
> functions
> 
> I would be very grateful for any thoughts about this topic in general and for 
> any advice on how to approach this so that a solution is found that is most 
> broadly useful and most cythonic.

It would definitely be cool to generate GPU support from the existing
Cython patterns, in addition to the OpenMP code that we already generate.
If that can be done, then users could enable GPU support by adding a C
compiler define to their CFLAGS (rather than rerunning Cython), or even
select between the two versions at runtime.

If the GPU support is per region, then how is the code section shipped to
the GPU? Is the infrastructure for this provided by the OpenCL framework or
does the user or the module need to set something up in addition?

Finally, generally speaking:
- PR welcome
- simple approach preferred (at least to get this started and prototyped)
- discussion welcome on this mailing list
- GitHub feature ticket seems to be missing, with a link to the ML thread
https://mail.python.org/pipermail/cython-devel/2020-January/005262.html

Stefan
___
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel


Re: [Cython] Adding GPU support to cython

2020-01-24 Thread Schlimbach, Frank
Hi Stefan,
thanks for your response. Good to hear this is still of interest.

Yes, I realized these are rather old CEPs. I spent some time with looking into 
the Cython code and concluded that it'd be the most consistent (and simplest) 
approach to stick with OpenMP and use it's offload pragmas (e.g. 'target' 
introduced in 4.5). Given a properly setup compiler this would in theory only 
require one or two compiler flags to enable offloading. I even have a first 
prototype which generates code that existing compilers seem to swallow. It's 
not ready for a PR since I have not been able to get it linked an run on GPU 
and I wanted to get some general feedback first. You can find the code on my 
offload branch https://github.com/fschlimb/cython/tree/offload (it's wip so 
please apologize that not all comments have been updated yet to reflect my 
changes).

Here's what it does:
- accept a new 'with' directive 'device' which marks a region/block to be 
offloaded to a device (OpenMP target)
  - I also considered extending 'gil' or 'parallel' to accept an optional 
'device' argument but an extra directive seemed more general/flexible to also 
allow non-parallel code
  - I don't believe we should try to automate offloading right now. Once we 
have something that works on explicit demand we can still think about a 
performance model and auto-enable offloading.
- the DeviceWithBlockNode is added to the 'parallel stack' and can occur only 
as the outmost parallel directive
- a 'with device()' requires 'nogil'
- a 'with device()' will create a new scope annotated with a '#pragma omp 
target'
  - all variables which get assigned within the 'with device()' block are 
currently mapped as 'tofrom'
  - all other variables used are mapped as 'to'
  - identifying 'from' candidates is harder and not yet done (need to know that 
there is required allocation but no assignment before the 'with device()' block)
  - identifying 'alloc' candidates would also need additional analysis (e.g. 
not used outside the 'device()' block)
- all object mode stuff (like exceptions for error handling) are currently 
disabled in a 'with device()' block

Example:

def f(int[:,::1] X):
cdef int v = 1
cdef int i
with gil, device(), parallel():
for i in prange(4):
X[i] = v

the 'with device' block becomes something like (simplified)

{
size_t __pyx_v_X__count = __pyx_v_X.shape[0]*__pyx_v_X.shape[1];
#pragma omp target map(to: __pyx_v_v) map(tofrom: __pyx_v_i , 
__pyx_v_X.data[0:__pyx_v_X__count], __pyx_v_X.memview, __pyx_v_X.shape, 
__pyx_v_X.strides, __pyx_v_X.suboffsets)
{
#pragma omp parallel
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i)
for((__pyx_v_i=0; __pyx_v_i<4; ++__pyx_v_i) {
 __pyx_v_X[__pyx_v_i] = __pyx_v_v;
}
}
}

There are lots of things to be added and improved, in particular I am currently 
adding an optional argument 'map' to 'device()' which allows manually setting 
the map-clauses for each variable. This is necessary to allow not only 
optimizations but also sending only partial array data to/from the device (like 
when the device memory cannot hold an entire array the developer would block 
the computation). We can probably add some magic for simple cases but there is 
probably no solution for the general problem of determining the accessed 
index-space.

Among others, things to also look at include
- non-contiguous arrays/memviews
- overlapping arrays/memviews
- keeping data on the device between 'with device()' blocks (USM (unified 
shared memory) or omp target data?)
- error handling
- tests
- docu/comments

I found that the functionality I needed to touch is somewhat scattered around 
the compiler pipeline. It might be worth thinking about restructuring a few 
things to make the whole OpenMP/parallel/offload stuff more maintainable. Of 
course you might see other solutions than mine which make this simpler.

Any thoughts/feedback/usecases appreciated

frank

-Original Message-
From: cython-devel  
On Behalf Of Stefan Behnel
Sent: Friday, January 24, 2020 11:22 AM
To: cython-devel@python.org
Subject: Re: [Cython] Adding GPU support to cython

Hi Frank,

sorry for leaving this unresponded for a while. I'm far from an expert in this, 
but it looks like no-one else jumped in, so here's my response.


Schlimbach, Frank schrieb am 06.01.20 um 12:09:
> I would like to work on extending cython with a way to offload cython code to 
> a GPU. I found to related CEPs 
> (https://github.com/cython/cython/wiki/enhancements-opencl and 
> https://github.com/cython/cython/wiki/enchancements-metadefintions).

So, just for a bit of context, these CEPs were written a decade ago, and 
Cython's code base, feature set, and the general Python ecosystem have evolved 
a lot since then. For example, "cython.parallel" and "prange()"
weren't even implemented back then.


> My current thinking is that a solution along the OpenCL CEP is most