Hi,
I opened a feature ticket: https://github.com/cython/cython/issues/3342
It describes my current prototype based on OpenMP.

Any feedback?

Also, I would like to do some more advanced analysis to improve the 
map-clauses. I do not want to go to a complex index analysis or alike, but a 
simple access analysis should cover many cases. All I would like to figure out 
is if a given variable (memview) was used (other than instantiated) before 
and/or after the device/parallel/device) block and ideally of a use was 
definitely a read-only. Any suggestion/hint how to do that?

Thanks

frank

-----Original Message-----
From: cython-devel <cython-devel-bounces+frank.schlimbach=intel....@python.org> 
On Behalf Of Schlimbach, Frank
Sent: Friday, January 24, 2020 12:55 PM
To: Core developer mailing list of the Cython compiler <cython-devel@python.org>
Subject: Re: [Cython] Adding GPU support to cython

Hi Stefan,
thanks for your response. Good to hear this is still of interest.

Yes, I realized these are rather old CEPs. I spent some time with looking into 
the Cython code and concluded that it'd be the most consistent (and simplest) 
approach to stick with OpenMP and use it's offload pragmas (e.g. 'target' 
introduced in 4.5). Given a properly setup compiler this would in theory only 
require one or two compiler flags to enable offloading. I even have a first 
prototype which generates code that existing compilers seem to swallow. It's 
not ready for a PR since I have not been able to get it linked an run on GPU 
and I wanted to get some general feedback first. You can find the code on my 
offload branch https://github.com/fschlimb/cython/tree/offload (it's wip so 
please apologize that not all comments have been updated yet to reflect my 
changes).

Here's what it does:
- accept a new 'with' directive 'device' which marks a region/block to be 
offloaded to a device (OpenMP target)
  - I also considered extending 'gil' or 'parallel' to accept an optional 
'device' argument but an extra directive seemed more general/flexible to also 
allow non-parallel code
  - I don't believe we should try to automate offloading right now. Once we 
have something that works on explicit demand we can still think about a 
performance model and auto-enable offloading.
- the DeviceWithBlockNode is added to the 'parallel stack' and can occur only 
as the outmost parallel directive
- a 'with device()' requires 'nogil'
- a 'with device()' will create a new scope annotated with a '#pragma omp 
target'
  - all variables which get assigned within the 'with device()' block are 
currently mapped as 'tofrom'
  - all other variables used are mapped as 'to'
  - identifying 'from' candidates is harder and not yet done (need to know that 
there is required allocation but no assignment before the 'with device()' block)
  - identifying 'alloc' candidates would also need additional analysis (e.g. 
not used outside the 'device()' block)
- all object mode stuff (like exceptions for error handling) are currently 
disabled in a 'with device()' block

Example:

def f(int[:,::1] X):
    cdef int v = 1
    cdef int i
    with gil, device(), parallel():
        for i in prange(4):
            X[i] = v

the 'with device' block becomes something like (simplified)

{
    size_t __pyx_v_X__count = __pyx_v_X.shape[0]*__pyx_v_X.shape[1];
    #pragma omp target map(to: __pyx_v_v) map(tofrom: __pyx_v_i , 
__pyx_v_X.data[0:__pyx_v_X__count], __pyx_v_X.memview, __pyx_v_X.shape, 
__pyx_v_X.strides, __pyx_v_X.suboffsets)
    {
        #pragma omp parallel
        #pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i)
        for((__pyx_v_i=0; __pyx_v_i<4; ++__pyx_v_i) {
             __pyx_v_X[__pyx_v_i] = __pyx_v_v;
        }
    }
}

There are lots of things to be added and improved, in particular I am currently 
adding an optional argument 'map' to 'device()' which allows manually setting 
the map-clauses for each variable. This is necessary to allow not only 
optimizations but also sending only partial array data to/from the device (like 
when the device memory cannot hold an entire array the developer would block 
the computation). We can probably add some magic for simple cases but there is 
probably no solution for the general problem of determining the accessed 
index-space.

Among others, things to also look at include
- non-contiguous arrays/memviews
- overlapping arrays/memviews
- keeping data on the device between 'with device()' blocks (USM (unified 
shared memory) or omp target data?)
- error handling
- tests
- docu/comments

I found that the functionality I needed to touch is somewhat scattered around 
the compiler pipeline. It might be worth thinking about restructuring a few 
things to make the whole OpenMP/parallel/offload stuff more maintainable. Of 
course you might see other solutions than mine which make this simpler.

Any thoughts/feedback/usecases appreciated

frank

-----Original Message-----
From: cython-devel <cython-devel-bounces+frank.schlimbach=intel....@python.org> 
On Behalf Of Stefan Behnel
Sent: Friday, January 24, 2020 11:22 AM
To: cython-devel@python.org
Subject: Re: [Cython] Adding GPU support to cython

Hi Frank,

sorry for leaving this unresponded for a while. I'm far from an expert in this, 
but it looks like no-one else jumped in, so here's my response.


Schlimbach, Frank schrieb am 06.01.20 um 12:09:
> I would like to work on extending cython with a way to offload cython code to 
> a GPU. I found to related CEPs 
> (https://github.com/cython/cython/wiki/enhancements-opencl and 
> https://github.com/cython/cython/wiki/enchancements-metadefintions).

So, just for a bit of context, these CEPs were written a decade ago, and 
Cython's code base, feature set, and the general Python ecosystem have evolved 
a lot since then. For example, "cython.parallel" and "prange()"
weren't even implemented back then.


> My current thinking is that a solution along the OpenCL CEP is most 
> effective, it does require many code changes and seems to provide a good 
> tradeoff between usability and efficiency.
> 
> I would like to suggest a few modifications to this approach, like
> 
>   *   using SYCL instead of OpenCL to closely follow existing parallel/prange 
> semantics more easily
>   *   selecting the device (CPU, GPU) per region rather than per file
>   *   maybe allowing calling appropriately annotated and written external 
> functions
> 
> I would be very grateful for any thoughts about this topic in general and for 
> any advice on how to approach this so that a solution is found that is most 
> broadly useful and most cythonic.

It would definitely be cool to generate GPU support from the existing Cython 
patterns, in addition to the OpenMP code that we already generate.
If that can be done, then users could enable GPU support by adding a C compiler 
define to their CFLAGS (rather than rerunning Cython), or even select between 
the two versions at runtime.

If the GPU support is per region, then how is the code section shipped to the 
GPU? Is the infrastructure for this provided by the OpenCL framework or does 
the user or the module need to set something up in addition?

Finally, generally speaking:
- PR welcome
- simple approach preferred (at least to get this started and prototyped)
- discussion welcome on this mailing list
- GitHub feature ticket seems to be missing, with a link to the ML thread 
https://mail.python.org/pipermail/cython-devel/2020-January/005262.html

Stefan
_______________________________________________
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Gary Kershaw Chairperson of the 
Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: 
Amtsgericht Muenchen HRB 186928

_______________________________________________
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Gary Kershaw
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

_______________________________________________
cython-devel mailing list
cython-devel@python.org
https://mail.python.org/mailman/listinfo/cython-devel

Reply via email to