Hi, I opened a feature ticket: https://github.com/cython/cython/issues/3342 It describes my current prototype based on OpenMP.
Any feedback? Also, I would like to do some more advanced analysis to improve the map-clauses. I do not want to go to a complex index analysis or alike, but a simple access analysis should cover many cases. All I would like to figure out is if a given variable (memview) was used (other than instantiated) before and/or after the device/parallel/device) block and ideally of a use was definitely a read-only. Any suggestion/hint how to do that? Thanks frank -----Original Message----- From: cython-devel <cython-devel-bounces+frank.schlimbach=intel....@python.org> On Behalf Of Schlimbach, Frank Sent: Friday, January 24, 2020 12:55 PM To: Core developer mailing list of the Cython compiler <cython-devel@python.org> Subject: Re: [Cython] Adding GPU support to cython Hi Stefan, thanks for your response. Good to hear this is still of interest. Yes, I realized these are rather old CEPs. I spent some time with looking into the Cython code and concluded that it'd be the most consistent (and simplest) approach to stick with OpenMP and use it's offload pragmas (e.g. 'target' introduced in 4.5). Given a properly setup compiler this would in theory only require one or two compiler flags to enable offloading. I even have a first prototype which generates code that existing compilers seem to swallow. It's not ready for a PR since I have not been able to get it linked an run on GPU and I wanted to get some general feedback first. You can find the code on my offload branch https://github.com/fschlimb/cython/tree/offload (it's wip so please apologize that not all comments have been updated yet to reflect my changes). Here's what it does: - accept a new 'with' directive 'device' which marks a region/block to be offloaded to a device (OpenMP target) - I also considered extending 'gil' or 'parallel' to accept an optional 'device' argument but an extra directive seemed more general/flexible to also allow non-parallel code - I don't believe we should try to automate offloading right now. Once we have something that works on explicit demand we can still think about a performance model and auto-enable offloading. - the DeviceWithBlockNode is added to the 'parallel stack' and can occur only as the outmost parallel directive - a 'with device()' requires 'nogil' - a 'with device()' will create a new scope annotated with a '#pragma omp target' - all variables which get assigned within the 'with device()' block are currently mapped as 'tofrom' - all other variables used are mapped as 'to' - identifying 'from' candidates is harder and not yet done (need to know that there is required allocation but no assignment before the 'with device()' block) - identifying 'alloc' candidates would also need additional analysis (e.g. not used outside the 'device()' block) - all object mode stuff (like exceptions for error handling) are currently disabled in a 'with device()' block Example: def f(int[:,::1] X): cdef int v = 1 cdef int i with gil, device(), parallel(): for i in prange(4): X[i] = v the 'with device' block becomes something like (simplified) { size_t __pyx_v_X__count = __pyx_v_X.shape[0]*__pyx_v_X.shape[1]; #pragma omp target map(to: __pyx_v_v) map(tofrom: __pyx_v_i , __pyx_v_X.data[0:__pyx_v_X__count], __pyx_v_X.memview, __pyx_v_X.shape, __pyx_v_X.strides, __pyx_v_X.suboffsets) { #pragma omp parallel #pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) for((__pyx_v_i=0; __pyx_v_i<4; ++__pyx_v_i) { __pyx_v_X[__pyx_v_i] = __pyx_v_v; } } } There are lots of things to be added and improved, in particular I am currently adding an optional argument 'map' to 'device()' which allows manually setting the map-clauses for each variable. This is necessary to allow not only optimizations but also sending only partial array data to/from the device (like when the device memory cannot hold an entire array the developer would block the computation). We can probably add some magic for simple cases but there is probably no solution for the general problem of determining the accessed index-space. Among others, things to also look at include - non-contiguous arrays/memviews - overlapping arrays/memviews - keeping data on the device between 'with device()' blocks (USM (unified shared memory) or omp target data?) - error handling - tests - docu/comments I found that the functionality I needed to touch is somewhat scattered around the compiler pipeline. It might be worth thinking about restructuring a few things to make the whole OpenMP/parallel/offload stuff more maintainable. Of course you might see other solutions than mine which make this simpler. Any thoughts/feedback/usecases appreciated frank -----Original Message----- From: cython-devel <cython-devel-bounces+frank.schlimbach=intel....@python.org> On Behalf Of Stefan Behnel Sent: Friday, January 24, 2020 11:22 AM To: cython-devel@python.org Subject: Re: [Cython] Adding GPU support to cython Hi Frank, sorry for leaving this unresponded for a while. I'm far from an expert in this, but it looks like no-one else jumped in, so here's my response. Schlimbach, Frank schrieb am 06.01.20 um 12:09: > I would like to work on extending cython with a way to offload cython code to > a GPU. I found to related CEPs > (https://github.com/cython/cython/wiki/enhancements-opencl and > https://github.com/cython/cython/wiki/enchancements-metadefintions). So, just for a bit of context, these CEPs were written a decade ago, and Cython's code base, feature set, and the general Python ecosystem have evolved a lot since then. For example, "cython.parallel" and "prange()" weren't even implemented back then. > My current thinking is that a solution along the OpenCL CEP is most > effective, it does require many code changes and seems to provide a good > tradeoff between usability and efficiency. > > I would like to suggest a few modifications to this approach, like > > * using SYCL instead of OpenCL to closely follow existing parallel/prange > semantics more easily > * selecting the device (CPU, GPU) per region rather than per file > * maybe allowing calling appropriately annotated and written external > functions > > I would be very grateful for any thoughts about this topic in general and for > any advice on how to approach this so that a solution is found that is most > broadly useful and most cythonic. It would definitely be cool to generate GPU support from the existing Cython patterns, in addition to the OpenMP code that we already generate. If that can be done, then users could enable GPU support by adding a C compiler define to their CFLAGS (rather than rerunning Cython), or even select between the two versions at runtime. If the GPU support is per region, then how is the code section shipped to the GPU? Is the infrastructure for this provided by the OpenCL framework or does the user or the module need to set something up in addition? Finally, generally speaking: - PR welcome - simple approach preferred (at least to get this started and prototyped) - discussion welcome on this mailing list - GitHub feature ticket seems to be missing, with a link to the ML thread https://mail.python.org/pipermail/cython-devel/2020-January/005262.html Stefan _______________________________________________ cython-devel mailing list cython-devel@python.org https://mail.python.org/mailman/listinfo/cython-devel Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Gary Kershaw Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 _______________________________________________ cython-devel mailing list cython-devel@python.org https://mail.python.org/mailman/listinfo/cython-devel Intel Deutschland GmbH Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de Managing Directors: Christin Eisenschmid, Gary Kershaw Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928 _______________________________________________ cython-devel mailing list cython-devel@python.org https://mail.python.org/mailman/listinfo/cython-devel