Hi Stefan,
thanks for your response. Good to hear this is still of interest.
Yes, I realized these are rather old CEPs. I spent some time with looking into
the Cython code and concluded that it'd be the most consistent (and simplest)
approach to stick with OpenMP and use it's offload pragmas (e.g. 'target'
introduced in 4.5). Given a properly setup compiler this would in theory only
require one or two compiler flags to enable offloading. I even have a first
prototype which generates code that existing compilers seem to swallow. It's
not ready for a PR since I have not been able to get it linked an run on GPU
and I wanted to get some general feedback first. You can find the code on my
offload branch https://github.com/fschlimb/cython/tree/offload (it's wip so
please apologize that not all comments have been updated yet to reflect my
changes).
Here's what it does:
- accept a new 'with' directive 'device' which marks a region/block to be
offloaded to a device (OpenMP target)
- I also considered extending 'gil' or 'parallel' to accept an optional
'device' argument but an extra directive seemed more general/flexible to also
allow non-parallel code
- I don't believe we should try to automate offloading right now. Once we
have something that works on explicit demand we can still think about a
performance model and auto-enable offloading.
- the DeviceWithBlockNode is added to the 'parallel stack' and can occur only
as the outmost parallel directive
- a 'with device()' requires 'nogil'
- a 'with device()' will create a new scope annotated with a '#pragma omp
target'
- all variables which get assigned within the 'with device()' block are
currently mapped as 'tofrom'
- all other variables used are mapped as 'to'
- identifying 'from' candidates is harder and not yet done (need to know that
there is required allocation but no assignment before the 'with device()' block)
- identifying 'alloc' candidates would also need additional analysis (e.g.
not used outside the 'device()' block)
- all object mode stuff (like exceptions for error handling) are currently
disabled in a 'with device()' block
Example:
def f(int[:,::1] X):
cdef int v = 1
cdef int i
with gil, device(), parallel():
for i in prange(4):
X[i] = v
the 'with device' block becomes something like (simplified)
{
size_t __pyx_v_X__count = __pyx_v_X.shape[0]*__pyx_v_X.shape[1];
#pragma omp target map(to: __pyx_v_v) map(tofrom: __pyx_v_i ,
__pyx_v_X.data[0:__pyx_v_X__count], __pyx_v_X.memview, __pyx_v_X.shape,
__pyx_v_X.strides, __pyx_v_X.suboffsets)
{
#pragma omp parallel
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i)
for((__pyx_v_i=0; __pyx_v_i<4; ++__pyx_v_i) {
__pyx_v_X[__pyx_v_i] = __pyx_v_v;
}
}
}
There are lots of things to be added and improved, in particular I am currently
adding an optional argument 'map' to 'device()' which allows manually setting
the map-clauses for each variable. This is necessary to allow not only
optimizations but also sending only partial array data to/from the device (like
when the device memory cannot hold an entire array the developer would block
the computation). We can probably add some magic for simple cases but there is
probably no solution for the general problem of determining the accessed
index-space.
Among others, things to also look at include
- non-contiguous arrays/memviews
- overlapping arrays/memviews
- keeping data on the device between 'with device()' blocks (USM (unified
shared memory) or omp target data?)
- error handling
- tests
- docu/comments
I found that the functionality I needed to touch is somewhat scattered around
the compiler pipeline. It might be worth thinking about restructuring a few
things to make the whole OpenMP/parallel/offload stuff more maintainable. Of
course you might see other solutions than mine which make this simpler.
Any thoughts/feedback/usecases appreciated
frank
-Original Message-
From: cython-devel
On Behalf Of Stefan Behnel
Sent: Friday, January 24, 2020 11:22 AM
To: cython-devel@python.org
Subject: Re: [Cython] Adding GPU support to cython
Hi Frank,
sorry for leaving this unresponded for a while. I'm far from an expert in this,
but it looks like no-one else jumped in, so here's my response.
Schlimbach, Frank schrieb am 06.01.20 um 12:09:
> I would like to work on extending cython with a way to offload cython code to
> a GPU. I found to related CEPs
> (https://github.com/cython/cython/wiki/enhancements-opencl and
> https://github.com/cython/cython/wiki/enchancements-metadefintions).
So, just for a bit of context, these CEPs were written a decade ago, and
Cython's code base, feature set, and the general Python ecosystem have evolved
a lot since then. For example, "cython.parallel" and "prange()"
weren't even implemented back then.
> My current thinking is that a solution along the OpenCL CEP is most