Re: [gomp4] Preserve NVPTX reconvergence points
On 06/19/2015 03:45 PM, Jakub Jelinek wrote: If the loop remains in the IL (isn't optimized away as unreachable or isn't removed, e.g. as a non-loop - say if it contains a noreturn call), the flags on struct loop should be still there. For the loop clauses (reduction always, and private/lastprivate if addressable etc.) for OpenMP simd / Cilk+ simd we use special arrays indexed by internal functions, which then during vectorization are shrunk (but in theory could be expanded too) to the right vectorization factor if vectorized, of course accesses within the loop vectorized using SIMD, and if not vectorized, shrunk to 1 element. I'd appreciate if you could describe that mechanism in more detail. As far as I can tell it is very poorly commented and documented in the code. I mean, it doesn't even follow the minimal coding standards of describing function inputs: /* Helper function of lower_rec_input_clauses, used for #pragma omp simd privatization. */ static bool lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int max_vf, tree idx, tree lane, tree ivar, tree lvar) Bernd
Re: [gomp4] Preserve NVPTX reconvergence points
On Wed, Jun 24, 2015 at 03:11:04PM +0200, Bernd Schmidt wrote: On 06/19/2015 03:45 PM, Jakub Jelinek wrote: If the loop remains in the IL (isn't optimized away as unreachable or isn't removed, e.g. as a non-loop - say if it contains a noreturn call), the flags on struct loop should be still there. For the loop clauses (reduction always, and private/lastprivate if addressable etc.) for OpenMP simd / Cilk+ simd we use special arrays indexed by internal functions, which then during vectorization are shrunk (but in theory could be expanded too) to the right vectorization factor if vectorized, of course accesses within the loop vectorized using SIMD, and if not vectorized, shrunk to 1 element. I'd appreciate if you could describe that mechanism in more detail. As far as I can tell it is very poorly commented and documented in the code. I mean, it doesn't even follow the minimal coding standards of describing function inputs: /* Helper function of lower_rec_input_clauses, used for #pragma omp simd privatization. */ static bool lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int max_vf, tree idx, tree lane, tree ivar, tree lvar) Here is the theory behind it: https://gcc.gnu.org/ml/gcc-patches/2013-04/msg01661.html In the end it is using internal functions instead of uglified builtins. I'd suggest you look at some of the libgomp.c/simd*.c tests, say with -O2 -mavx2 -fdump-tree-{omplower,ssa,ifcvt,vect,optimized} to see how it is lowered and expanded. I assume #pragma omp simd roughly corresponds to #pragma acc loop vector, maxvf for PTX vectorization is supposedly 32 (warp size). For SIMD vectorization, if the vectorization fails, the arrays are shrunk to 1 element, otherwise they are shrunk to the vectorization factor, and later optimizations if they aren't really addressable optimized using FRE and other memory optimizations so that they don't touch memory unless really needed. For the PTX style vectorization (parallelization between threads in a warp), I'd say you would always shrink to 1 element again, but such variables would be local to each of the threads in the warp (or another possibility is shared arrays of size 32 indexed by %tid.x 31), while addressable variables without such magic type would be shared among all threads; non-addressable variables (SSA_NAMEs) depending on where they are used. You'd need to transform reductions (which are right now represented as another loop, from 0 to an internal function, so easily recognizable) into the PTX reductions. Also, lastprivate is now an access to the array using last lane internal function, dunno what that corresponds to in PTX (perhaps also a reduction where all but the thread executing the last iteration say or in 0 and the remaining thread ors in the lastprivate value). Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
On 06/22/15 11:18, Bernd Schmidt wrote: You can have a hint that it is desirable, but not a hint that it is correct (because passes in between may invalidate that). The OpenACC directives guarantee to the compiler that the program can be transformed into a parallel form. If we lose them early we must then rely on our analysis which may not be strong enough to prove that the loop can be parallelized. If we make these transformations early enough, while we still have the OpenACC directives, we can guarantee that we do exactly what the programmer specified. How does this differ from openmp's needs to preserve parallelism on a parallel loop? Is it more than the reconvergence issue? nathan -- Nathan Sidwell
Re: [gomp4] Preserve NVPTX reconvergence points
On 06/22/15 12:20, Jakub Jelinek wrote: OpenMP worksharing loop is just coordination between the threads in the team, which thread takes which subset of the loop's iterations, and optionally followed by a barrier. OpenMP simd loop is a loop that has certain properties guaranteed by the user and can be vectorized. In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then idles on some of them until there is work for them. correct. I expressed my question poorly. What I mean is that in openmp, a loop that is parallelizeable (by user decree, I guess[*]), should not be transformed such that it is not parallelizeable. This seems to me to be a common requirement of both languages. How one gets parallel threads of execution to the body of the loop is a different question. nathan [*] For ones where the compiler needs to detect parallizeablilty, it's preferable that it doesn't do something earlier to force serializeablility. -- Nathan Sidwell
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek ja...@redhat.com wrote: On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say all threads in a CTA/warp will be coherent after such-and-such a loop), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become ill-formed.) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how late loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) I don't understand why lowering the way you suggest helps here at all. In the proposed scheme, you essentially have whole function in e.g. worker-single or vector-single mode, which you need to be able to handle properly in any case, because users can write such routines themselves. And then you can have a loop in such a function that has some special attribute, a hint that it is desirable to vectorize it (for PTX the PTX way) or use vector-single mode for it in a worker-single function. So, the special pass then of course needs to handle all the needed broadcasting and reduction required to change the mode from e.g. worker-single to vector-single, but the convergence points still would be either on the boundary of such loops to be vectorized or parallelized, or wherever else they appear in normal vector-single or worker-single functions (around the calls to certainly calls?). I think most of my concerns are centred around loops (with the markings you suggest) that might be split into parts: if that cannot happen for loops that are annotated as you describe, maybe things will work out OK. (Apologies for my ignorance here, this isn't a part of the compiler that I know anything about.) Julian
Re: [gomp4] Preserve NVPTX reconvergence points
On 06/22/2015 04:24 PM, Jakub Jelinek wrote: I don't understand why lowering the way you suggest helps here at all. In the proposed scheme, you essentially have whole function in e.g. worker-single or vector-single mode, which you need to be able to handle properly in any case, because users can write such routines themselves. And then you can have a loop in such a function that has some special attribute, a hint that it is desirable to vectorize it (for PTX the PTX way) or use vector-single mode for it in a worker-single function. You can have a hint that it is desirable, but not a hint that it is correct (because passes in between may invalidate that). The OpenACC directives guarantee to the compiler that the program can be transformed into a parallel form. If we lose them early we must then rely on our analysis which may not be strong enough to prove that the loop can be parallelized. If we make these transformations early enough, while we still have the OpenACC directives, we can guarantee that we do exactly what the programmer specified. Bernd
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, Jun 22, 2015 at 12:08:36PM -0400, Nathan Sidwell wrote: On 06/22/15 11:18, Bernd Schmidt wrote: You can have a hint that it is desirable, but not a hint that it is correct (because passes in between may invalidate that). The OpenACC directives guarantee to the compiler that the program can be transformed into a parallel form. If we lose them early we must then rely on our analysis which may not be strong enough to prove that the loop can be parallelized. If we make these transformations early enough, while we still have the OpenACC directives, we can guarantee that we do exactly what the programmer specified. How does this differ from openmp's needs to preserve parallelism on a parallel loop? Is it more than the reconvergence issue? OpenMP has significantly different execution model, a parallel block in OpenMP is run by certain number of threads (the initial thread (the one encountering that region) and then dpeending on clauses and library decisions perhaps others), with a barrier at the end of the region, and afterwards only the initial thread continues again. So, an OpenMP parallel is implemented as a library call, taking outlined function from the parallel's body as one of its arguments and the body is executed by the initial thread and perhaps others. OpenMP worksharing loop is just coordination between the threads in the team, which thread takes which subset of the loop's iterations, and optionally followed by a barrier. OpenMP simd loop is a loop that has certain properties guaranteed by the user and can be vectorized. In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then idles on some of them until there is work for them. Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek ja...@redhat.com wrote: On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say all threads in a CTA/warp will be coherent after such-and-such a loop), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become ill-formed.) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how late loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) I don't understand why lowering the way you suggest helps here at all. In the proposed scheme, you essentially have whole function in e.g. worker-single or vector-single mode, which you need to be able to handle properly in any case, because users can write such routines themselves. In vector-single or worker-single mode, divergence of threads within a warp or a CTA is controlled by broadcasting the controlling expression of conditional branches to the set of inactive threads, so each of those follows along with the active thread. So you only get potentially-problematic thread divergence when workers or vectors are operating in partitioned mode. So, for instance, a made-up example: #pragma acc parallel { #pragma acc loop gang for (i = 0; i N; i++)) { #pragma acc loop worker for (j = 0; j M; j++) { if (j M / 2) /* stmt 1 */ else /* stmt 2 */ } /* reconvergence point: thread barrier */ [...] } } Here stmt 1 and stmt 2 execute in worker-partitioned, vector-single mode. With early lowering, the reconvergence point can be inserted at the end of the loop, and abnormal edges (etc.) can be used to ensure that the CFG does not get changed in such a way that there is no longer a unique point at which the loop threads reconverge. With late lowering, it's no longer obvious to me if that can still be done. Julian
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, Jun 22, 2015 at 06:48:10PM +0100, Julian Brown wrote: In vector-single or worker-single mode, divergence of threads within a warp or a CTA is controlled by broadcasting the controlling expression of conditional branches to the set of inactive threads, so each of those follows along with the active thread. So you only get potentially-problematic thread divergence when workers or vectors are operating in partitioned mode. So, for instance, a made-up example: #pragma acc parallel { #pragma acc loop gang for (i = 0; i N; i++)) { #pragma acc loop worker for (j = 0; j M; j++) { if (j M / 2) /* stmt 1 */ else /* stmt 2 */ } /* reconvergence point: thread barrier */ [...] } } Here stmt 1 and stmt 2 execute in worker-partitioned, vector-single mode. With early lowering, the reconvergence point can be inserted at the end of the loop, and abnormal edges (etc.) can be used to ensure that the CFG does not get changed in such a way that there is no longer a unique point at which the loop threads reconverge. With late lowering, it's no longer obvious to me if that can still be done. Why? The loop still has an exit edge (if there is no break/return/throw out of the loop which I bet is not allowed), so you just insert the reconvergence point at the exit edge from the loop. For the late lowering, I said it is up for benchmarking/investigation where it would be best placed, it doesn't have to be after the loop passes, there are plenty of optimization passes even before those. But once you turn many of the SSA_NAMEs in a function into (ab) ssa vars, many optimizations just give up. And, if you really want to avoid certain loop optimizations, you have always the possibility to e.g. wrap certain statement in the loop in internal function (e.g. the loop condition) or something similar to make the passes more careful about those loops and make it easier to lower it later. Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
On 06/19/2015 03:45 PM, Jakub Jelinek wrote: I actually believe having some optimization passes in between the ompexp and the lowering of the IR into the form PTX wants is highly desirable, the form with the worker-single or vector-single mode lowered will contain too complex CFG for many optimizations to be really effective, especially if it uses abnormal edges. E.g. inlining supposedly would have harder job etc. What exact unpredictable effects do you fear? Mostly the ones I can't predict. But let's take one example, LICM: let's say you pull some assignment out of a loop, then you find yourself in one of two possible situations: either it's become not actually available inside the loop (because the data and control flow is not described correctly and the compiler doesn't know what's going on), or, to avoid that, you introduce additional broadcasting operations when entering the loop, which might be quite expensive. Bernd
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, Jun 22, 2015 at 03:59:57PM +0200, Bernd Schmidt wrote: On 06/19/2015 03:45 PM, Jakub Jelinek wrote: I actually believe having some optimization passes in between the ompexp and the lowering of the IR into the form PTX wants is highly desirable, the form with the worker-single or vector-single mode lowered will contain too complex CFG for many optimizations to be really effective, especially if it uses abnormal edges. E.g. inlining supposedly would have harder job etc. What exact unpredictable effects do you fear? Mostly the ones I can't predict. But let's take one example, LICM: let's say you pull some assignment out of a loop, then you find yourself in one of two possible situations: either it's become not actually available inside the loop (because the data and control flow is not described correctly and the compiler doesn't know what's going on), or, to avoid that, you introduce Why do you think that would happen? E.g. for non-addressable gimple types you'd most likely just have a PHI for it on the loop. additional broadcasting operations when entering the loop, which might be quite expensive. If the PHI has cheap initialization, there is not a problem to emit it as initialization in the loop instead of a broadcast (kind like RA rematerialization). And by actually adding such an optimization, you help even code that has computation in a vector-single code and uses it in vector acc loop. Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say all threads in a CTA/warp will be coherent after such-and-such a loop), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become ill-formed.) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how late loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) I don't understand why lowering the way you suggest helps here at all. In the proposed scheme, you essentially have whole function in e.g. worker-single or vector-single mode, which you need to be able to handle properly in any case, because users can write such routines themselves. And then you can have a loop in such a function that has some special attribute, a hint that it is desirable to vectorize it (for PTX the PTX way) or use vector-single mode for it in a worker-single function. So, the special pass then of course needs to handle all the needed broadcasting and reduction required to change the mode from e.g. worker-single to vector-single, but the convergence points still would be either on the boundary of such loops to be vectorized or parallelized, or wherever else they appear in normal vector-single or worker-single functions (around the calls to certainly calls?). Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
On Fri, 19 Jun 2015 14:25:57 +0200 Jakub Jelinek ja...@redhat.com wrote: On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote: On 05/28/2015 05:08 PM, Jakub Jelinek wrote: I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. The problem is that many of the transformations we need to do are really GPU specific, and with the current structure of omplow/ompexp they are being done in the host compiler. The offloading scheme we decided on does not give us the means to write out multiple versions of an offloaded function where each target gets a different one. For that reason I think we should postpone these lowering decisions until we're in the accel compiler, where they could be controlled by target hooks, and over the last two weeks I've been doing some experiments to see how that could be achieved. I wonder why struct loop flags and other info together with function attributes and/or cgraph flags and other info aren't sufficient for the OpenACC needs. Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? Why can't the execution model (normal, vector-single and worker-single) be simply attributes on functions or cgraph node flags and the kind of #acc loop simply be flags on struct loop, like already OpenMP simd / Cilk+ simd is? One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say all threads in a CTA/warp will be coherent after such-and-such a loop), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become ill-formed.) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how late loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) Julian
Re: [gomp4] Preserve NVPTX reconvergence points
On 06/19/2015 02:25 PM, Jakub Jelinek wrote: Emitting PTX specific code from current ompexp is highly undesirable of course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees around for too long either, they've never meant to be used in low gimple, and even all the early optimization passes could screw them up badly, The idea is not to keep them around for very long, but I think there's no reason why they couldn't survive a while longer. Between ompexpand and the end of build_ssa_passes, we have (ignoring things like chkp and ubsan which can just be turned off for offloaded functions if necessary): NEXT_PASS (pass_ipa_free_lang_data); NEXT_PASS (pass_ipa_function_and_variable_visibility); NEXT_PASS (pass_fixup_cfg); NEXT_PASS (pass_init_datastructures); NEXT_PASS (pass_build_ssa); NEXT_PASS (pass_early_warn_uninitialized); NEXT_PASS (pass_nothrow); Nothing in there strikes me as particularly problematic if we can make things like GIMPLE_OMP_FOR survive into-ssa - which I think I did in my patch. Besides, the OpenACC kernels path generates them in SSA form anyway during parloops so one could make the argument that this is a step towards better consistency. they are also very much OpenMP or OpenACC specific, rather than representing language neutral behavior, so there is a problem that you'd need M x N different expansions of those constructs, which is not really maintainable (M being number of supported offloading standards, right now 2, and N number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). Well, that's a problem we have anyway, independent on how we implement all these devices and standards. I don't see how that's relevant to the discussion. I wonder why struct loop flags and other info together with function attributes and/or cgraph flags and other info aren't sufficient for the OpenACC needs. Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? Why can't the execution model (normal, vector-single and worker-single) be simply attributes on functions or cgraph node flags and the kind of #acc loop simply be flags on struct loop, like already OpenMP simd / Cilk+ simd is? We haven't looked at Cilk+ or anything like that. You suggest using attributes and flags, but at what point do you intend to actually lower the IR to actually represent what's going on? The vector level parallelism is something where on the host/host_noshm/XeonPhi (dunno about HSA) you want vectorization to happen, and that is already implemented in the vectorizer pass, implementing it again elsewhere is highly undesirable. For PTX the implementation is of course different, and the vectorizer is likely not the right pass to handle them, but why can't the same struct loop flags be used by the pass that handles the conditionalization of execution for the 2 of the 3 above modes? Agreed on wanting the vectorizer to handle things for normal machines, that is one of the motivations for pushing the lowering past the offload LTO writeout stage. The problem with OpenACC on GPUs is that the predication really changes the CFG and the data flow - I fear unpredictable effects if we let any optimizers run before lowering OpenACC to the point where we actually represent what's going on in the function. Bernd
Re: [gomp4] Preserve NVPTX reconvergence points
On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote: they are also very much OpenMP or OpenACC specific, rather than representing language neutral behavior, so there is a problem that you'd need M x N different expansions of those constructs, which is not really maintainable (M being number of supported offloading standards, right now 2, and N number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). Well, that's a problem we have anyway, independent on how we implement all these devices and standards. I don't see how that's relevant to the discussion. It is relevant, because if you lower early (omplower/ompexp) into some IL form common to all the offloading standards, then it is M + N. I wonder why struct loop flags and other info together with function attributes and/or cgraph flags and other info aren't sufficient for the OpenACC needs. Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? Why can't the execution model (normal, vector-single and worker-single) be simply attributes on functions or cgraph node flags and the kind of #acc loop simply be flags on struct loop, like already OpenMP simd / Cilk+ simd is? We haven't looked at Cilk+ or anything like that. You suggest using attributes and flags, but at what point do you intend to actually lower the IR to actually represent what's going on? I think around where the vectorizer is, perhaps before the loop optimization pass queue (or after it, some investigation is needed). The vector level parallelism is something where on the host/host_noshm/XeonPhi (dunno about HSA) you want vectorization to happen, and that is already implemented in the vectorizer pass, implementing it again elsewhere is highly undesirable. For PTX the implementation is of course different, and the vectorizer is likely not the right pass to handle them, but why can't the same struct loop flags be used by the pass that handles the conditionalization of execution for the 2 of the 3 above modes? Agreed on wanting the vectorizer to handle things for normal machines, that is one of the motivations for pushing the lowering past the offload LTO writeout stage. The problem with OpenACC on GPUs is that the predication really changes the CFG and the data flow - I fear unpredictable effects if we let any optimizers run before lowering OpenACC to the point where we actually represent what's going on in the function. I actually believe having some optimization passes in between the ompexp and the lowering of the IR into the form PTX wants is highly desirable, the form with the worker-single or vector-single mode lowered will contain too complex CFG for many optimizations to be really effective, especially if it uses abnormal edges. E.g. inlining supposedly would have harder job etc. What exact unpredictable effects do you fear? If the loop remains in the IL (isn't optimized away as unreachable or isn't removed, e.g. as a non-loop - say if it contains a noreturn call), the flags on struct loop should be still there. For the loop clauses (reduction always, and private/lastprivate if addressable etc.) for OpenMP simd / Cilk+ simd we use special arrays indexed by internal functions, which then during vectorization are shrunk (but in theory could be expanded too) to the right vectorization factor if vectorized, of course accesses within the loop vectorized using SIMD, and if not vectorized, shrunk to 1 element. So the PTX IL lowering pass could use the same arrays (omp simd array attribute) to transform the decls into thread local vars as opposed to vars shared by the whole CTA. Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote: On 05/28/2015 05:08 PM, Jakub Jelinek wrote: I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. The problem is that many of the transformations we need to do are really GPU specific, and with the current structure of omplow/ompexp they are being done in the host compiler. The offloading scheme we decided on does not give us the means to write out multiple versions of an offloaded function where each target gets a different one. For that reason I think we should postpone these lowering decisions until we're in the accel compiler, where they could be controlled by target hooks, and over the last two weeks I've been doing some experiments to see how that could be achieved. Emitting PTX specific code from current ompexp is highly undesirable of course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees around for too long either, they've never meant to be used in low gimple, and even all the early optimization passes could screw them up badly, they are also very much OpenMP or OpenACC specific, rather than representing language neutral behavior, so there is a problem that you'd need M x N different expansions of those constructs, which is not really maintainable (M being number of supported offloading standards, right now 2, and N number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). I wonder why struct loop flags and other info together with function attributes and/or cgraph flags and other info aren't sufficient for the OpenACC needs. Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? Why can't the execution model (normal, vector-single and worker-single) be simply attributes on functions or cgraph node flags and the kind of #acc loop simply be flags on struct loop, like already OpenMP simd / Cilk+ simd is? I mean, you need to implement the PTX broadcasting etc. for the 3 different modes (one where each thread executes everything, another one where only first thread in a warp executes everything, other threads only call functions with the same mode, or specially marked loops), another one where only a single thread (in the CTA) executes everything, other threads only call functions with the same mode or specially marked loops, because if you have #acc routine (something) ... that is just an attribute of a function, not really some construct in the body of it. The vector level parallelism is something where on the host/host_noshm/XeonPhi (dunno about HSA) you want vectorization to happen, and that is already implemented in the vectorizer pass, implementing it again elsewhere is highly undesirable. For PTX the implementation is of course different, and the vectorizer is likely not the right pass to handle them, but why can't the same struct loop flags be used by the pass that handles the conditionalization of execution for the 2 of the 3 above modes? Then there is the worker level parallelism, but I'd hope it can be handled similarly, and supposedly the pass that handles vector-single and worker-single lowering for PTX could do the same for non-PTX targets - if the OpenACC execution model is that all the (e.g. pthread based) threads are started immediately and you skip in worker-single mode work on other than the first thread, then it needs to behave similarly to PTX, just probably needs to use library calls rather than PTX builtins to query the thread number. Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
On 05/28/2015 05:08 PM, Jakub Jelinek wrote: I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. The problem is that many of the transformations we need to do are really GPU specific, and with the current structure of omplow/ompexp they are being done in the host compiler. The offloading scheme we decided on does not give us the means to write out multiple versions of an offloaded function where each target gets a different one. For that reason I think we should postpone these lowering decisions until we're in the accel compiler, where they could be controlled by target hooks, and over the last two weeks I've been doing some experiments to see how that could be achieved. The basic idea is to delay expanding the inner regions of an OpenACC target region during ompexp, write out offload LTO (almost) immediately afterwards, and then have another ompexp phase which runs on the accel compiler to take the offloaded function to its final form. The first attempt really did write LTO immediately after, before moving to SSA phase. It seems that this could be made to work, but the pass manager and LTO code rather expects that what is being read in is in SSA form already. Also, some offloaded code is produced by OpenACC kernels expansion much later in the compilation, so with this approach we have an inconsistency where functions we get back from LTO are at very different levels of lowering. The next attempt was to run the into-ssa passes after ompexpand, and only then write things out. I've changed the gimple representation of some OMP statements (primarily gimple_omp_for) so that they are relatively normal statements with operands that can be transformed into SSA form. As far as what's easier to work with - I believe some of the transformations we have to do could benefit from being in SSA, but on the other hand the OpenACC predication code has given me some trouble. I've still not sompletely convinced myself that the update_ssa call I've added will actually do the right thing after we've mucked up the CFG. I'm appending a proof-of-concept patch. This is intended to show the general outline of what I have in mind, rather than pass the testsuite. It's good enough to compile some of the OpenACC testcases (let's say worker-single-3 if you need one). Let me know what you think. Bernd Index: gcc/cgraphunit.c === --- gcc/cgraphunit.c (revision 224547) +++ gcc/cgraphunit.c (working copy) @@ -2171,6 +2171,23 @@ ipa_passes (void) execute_ipa_pass_list (passes-all_small_ipa_passes); if (seen_error ()) return; + + if (g-have_offload) + { + extern void write_offload_lto (); + section_name_prefix = OFFLOAD_SECTION_NAME_PREFIX; + write_offload_lto (); + } +} + bool do_local_opts = !in_lto_p; +#ifdef ACCEL_COMPILER + do_local_opts = true; +#endif + if (do_local_opts) +{ + execute_ipa_pass_list (passes-all_local_opt_passes); + if (seen_error ()) + return; } /* This extra symtab_remove_unreachable_nodes pass tends to catch some @@ -2182,7 +2199,7 @@ ipa_passes (void) if (symtab-state IPA_SSA) symtab-state = IPA_SSA; - if (!in_lto_p) + if (do_local_opts) { /* Generate coverage variables and constructors. */ coverage_finish (); @@ -2285,6 +2302,14 @@ symbol_table::compile (void) if (seen_error ()) return; +#ifdef ACCEL_COMPILER + { +cgraph_node *node; +FOR_EACH_DEFINED_FUNCTION (node) + node-get_untransformed_body (); + } +#endif + #ifdef ENABLE_CHECKING symtab_node::verify_symtab_nodes (); #endif Index: gcc/config/nvptx/nvptx.c === --- gcc/config/nvptx/nvptx.c (revision 224547) +++ gcc/config/nvptx/nvptx.c (working copy) @@ -1171,18 +1171,42 @@ nvptx_section_from_addr_space (addr_spac } } -/* Determine whether DECL goes into .const or .global. */ +/* Determine the address space DECL lives in. */ -const char * -nvptx_section_for_decl (const_tree decl) +static addr_space_t +nvptx_addr_space_for_decl (const_tree decl) { + if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL) +return ADDR_SPACE_GENERIC; + + if (lookup_attribute (oacc ganglocal, DECL_ATTRIBUTES (decl)) != NULL_TREE) +return ADDR_SPACE_SHARED; + bool is_const = (CONSTANT_CLASS_P (decl) || TREE_CODE (decl) == CONST_DECL || TREE_READONLY (decl)); if (is_const) -return .const; +return ADDR_SPACE_CONST; - return .global; + return ADDR_SPACE_GLOBAL; +} + +/* Return a ptx string representing the address space for a variable DECL. */ + +const char * +nvptx_section_for_decl (const_tree decl) +{ + switch (nvptx_addr_space_for_decl (decl)) +{ +case ADDR_SPACE_CONST: + return .const; +
Re: [gomp4] Preserve NVPTX reconvergence points
On Thu, 28 May 2015 16:37:04 +0200 Richard Biener richard.guent...@gmail.com wrote: On Thu, May 28, 2015 at 4:06 PM, Julian Brown jul...@codesourcery.com wrote: For NVPTX, it is vitally important that the divergence of threads within a warp can be controlled: in particular we must be able to generate code that we know reconverges at a particular point. Unfortunately GCC's middle-end optimisers can cause this property to be violated, which causes problems for the OpenACC execution model we're planning to use for NVPTX. Hmm, I don't think adding a new edge flag is good nor necessary. It seems to me that instead the broadcast operation should have abnormal control flow and thus basic-blocks should be split either before or after it (so either incoming or outgoing edge(s) should be abnormal). I suppose splitting before the broadcast would be best (thus handle it similar to setjmp ()). Here's a version of the patch that uses abnormal edges with semantics unchanged, splitting the false/non-execution edge using a dummy block to avoid the prohibited case of both EDGE_TRUE/EDGE_FALSE and EDGE_ABNORMAL on the outgoing edges of a GIMPLE_COND. So for a fragment like this: if (threadIdx.x == 0) /* cond_bb */ { /* work */ p0 = ...; /* assign */ } pN = broadcast(p0); if (pN) goto T; else goto F; Incoming edges to a broadcast operation have EDGE_ABNORMAL set: ++ |cond_bb |, ++| | (true edge) | (false edge) v v ++ +---+ | (work) | | dummy | ++ +---+ | assign || ++| ABNORM| |ABNORM v | ++---' | bcast | ++ | cond | ++ / \ T F The abnormal edges actually serve two purposes, I think: as well as ensuring the broadcast operation takes place when a warp is non-diverged/coherent, they ensure that p0 is not seen as uninitialised along the false path from cond_bb, possibly leading to the broadcast operation being optimised away as partially redundant. This feels somewhat fragile though! We'll have to continue to think about warp divergence in subsequent patches. The patch passes libgomp testing (with Bernd's recent worker-single patch also). OK for gomp4 branch (together with the previously-mentioned inline thread builtin patch)? Thanks, Julian ChangeLog gcc/ * omp-low.c (make_predication_test): Split false block out of cond_bb, making latter edge abnormal. (predicate_bb): Set EDGE_ABNORMAL on edges before broadcast operations.commit 38056ae4a29f93ce54715dfad843a233f3b0fd2a Author: Julian Brown jul...@codesourcery.com Date: Mon Jun 1 11:12:41 2015 -0700 Use abnormal edges before broadcast ops diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 7048f9f..310eb72 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -10555,7 +10555,16 @@ make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask) gsi_insert_after (tmp_gsi, cond_stmt, GSI_NEW_STMT); true_edge-flags = EDGE_TRUE_VALUE; - make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); + + /* Force an abnormal edge before a broadcast operation that might be present + in SKIP_DEST_BB. This is only done for the non-execution edge (with + respect to the predication done by this function) -- the opposite + (execution) edge that reaches the broadcast operation must be made + abnormal also, e.g. in this function's caller. */ + edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); + basic_block false_abnorm_bb = split_edge (e); + edge abnorm_edge = single_succ_edge (false_abnorm_bb); + abnorm_edge-flags |= EDGE_ABNORMAL; } /* Apply OpenACC predication to basic block BB which is in @@ -10605,6 +10614,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask) mask); edge e = split_block (bb, splitpoint); + e-flags = EDGE_ABNORMAL; skip_dest_bb = e-dest; gimple_cond_set_condition (as_a gcond * (stmt), EQ_EXPR, @@ -10624,6 +10634,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask) gsi_asgn, mask); edge e = split_block (bb, splitpoint); + e-flags = EDGE_ABNORMAL; skip_dest_bb = e-dest; gimple_switch_set_index (sstmt, new_var);
Re: [gomp4] Preserve NVPTX reconvergence points
On Thu, May 28, 2015 at 4:06 PM, Julian Brown jul...@codesourcery.com wrote: For NVPTX, it is vitally important that the divergence of threads within a warp can be controlled: in particular we must be able to generate code that we know reconverges at a particular point. Unfortunately GCC's middle-end optimisers can cause this property to be violated, which causes problems for the OpenACC execution model we're planning to use for NVPTX. As a brief example: code running in vector-single mode runs on a single thread of a warp, and must broadcast condition results to other threads of the warp so that they can follow along and be ready for vector-partitioned execution when necessary. #pragma acc parallel { #pragma acc loop gang for (i = 0; i N; i++) { /* This is vector-single mode. */ n = ...; switch (n) { case 1: #pragma acc loop vector for (...) { /* This is vector-partitioned mode. */ } ... } } } Here, the calculation n = ... takes place on a single thread (of each partitioned gang of the outer loop), but the switch statement (terminating the BB) must be executed by all threads in the warp. The vector-single statements will be translated using a branch around for the idle threads: if (threadIdx.x == 0) { n_0 = ...; } n_x = broadcast (n_0) switch (n_x) ... Where broadcast is an operation that transfers values from some other thread of a warp (i.e., the zeroth) to the current thread (implemented as a shfl instruction for NVPTX). I observed a similar example to this cloning the broadcast and switch instructions (in the .dom1 dump), along the lines of: if (threadIdx.x == 0) { n_0 = ...; n_x = broadcast (n_0) switch (n_x) ... } else { n_x = broadcast (n_0) switch (n_x) ... } This doesn't work because the broadcast operation has to be run with non-diverged warps for correct operation, and here there is divergence due to the if (threadIdx.x == 0) condition. So, the way I have tried to handle this is by attempting to inhibit optimisation along edges which have a reconvergence point as their destination. The essential idea is to make such edges abnormal, although the existing EDGE_ABNORMAL flag is not used because that has implicit meaning built into it already, and the new edge type may need to be handled differently in some areas. One example is that at present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL set on their EDGE_TRUE or EDGE_FALSE outgoing edges. The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE), for the GIMPLE CFG only. In principle there's nothing to stop the flag being propagated to the RTL CFG also, in which case it'd probably be set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g. EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which inspect the RTL CFG can continue to only check the ABNORMAL flag. But so far (in rather limited testing!), that has not been observed to be necessary. (We can control RTL CFG manipulation indirectly by using the CANNOT_COPY_INSN_P target hook, sensitive e.g. to the broadcast instruction.) For the GIMPLE CFG (i.e. in passes operating on GIMPLE form), EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e., inhibiting certain optimisations), and so has been added to relevant conditionals largely mechanically. Places where it is treated specially are: * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on outgoing edges of a block concluding with a GIMPLE_COND statement. But, we allow EDGE_TO_RECONVERGENCE there. * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass tries to split the edge later). There are probably other optimisations that will be tripped up by the new flag along the same lines as the VRP tweak above, which we will no doubt discover in due course. Together with the patch, https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html This shows no regressions for the libgomp tests. OK for gomp4 branch? Hmm, I don't think adding a new edge flag is good nor necessary. It seems to me that instead the broadcast operation should have abnormal control flow and thus basic-blocks should be split either before or after it (so either incoming or outgoing edge(s) should be abnormal). I suppose splitting before the broadcast would be best (thus handle it similar to setjmp ()). Richard. Thanks, Julian ChangeLog gcc/ * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag. (bb_hash_abnorm_or_reconv_pred): New function. (hash_abnormal_or_eh_outgoing_edge_p): Consider EDGE_TO_RECONVERGENCE also. * cfg-flags.def (TO_RECONVERGENCE): Add flag. * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges leading to a reconvergence
[gomp4] Preserve NVPTX reconvergence points
For NVPTX, it is vitally important that the divergence of threads within a warp can be controlled: in particular we must be able to generate code that we know reconverges at a particular point. Unfortunately GCC's middle-end optimisers can cause this property to be violated, which causes problems for the OpenACC execution model we're planning to use for NVPTX. As a brief example: code running in vector-single mode runs on a single thread of a warp, and must broadcast condition results to other threads of the warp so that they can follow along and be ready for vector-partitioned execution when necessary. #pragma acc parallel { #pragma acc loop gang for (i = 0; i N; i++) { /* This is vector-single mode. */ n = ...; switch (n) { case 1: #pragma acc loop vector for (...) { /* This is vector-partitioned mode. */ } ... } } } Here, the calculation n = ... takes place on a single thread (of each partitioned gang of the outer loop), but the switch statement (terminating the BB) must be executed by all threads in the warp. The vector-single statements will be translated using a branch around for the idle threads: if (threadIdx.x == 0) { n_0 = ...; } n_x = broadcast (n_0) switch (n_x) ... Where broadcast is an operation that transfers values from some other thread of a warp (i.e., the zeroth) to the current thread (implemented as a shfl instruction for NVPTX). I observed a similar example to this cloning the broadcast and switch instructions (in the .dom1 dump), along the lines of: if (threadIdx.x == 0) { n_0 = ...; n_x = broadcast (n_0) switch (n_x) ... } else { n_x = broadcast (n_0) switch (n_x) ... } This doesn't work because the broadcast operation has to be run with non-diverged warps for correct operation, and here there is divergence due to the if (threadIdx.x == 0) condition. So, the way I have tried to handle this is by attempting to inhibit optimisation along edges which have a reconvergence point as their destination. The essential idea is to make such edges abnormal, although the existing EDGE_ABNORMAL flag is not used because that has implicit meaning built into it already, and the new edge type may need to be handled differently in some areas. One example is that at present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL set on their EDGE_TRUE or EDGE_FALSE outgoing edges. The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE), for the GIMPLE CFG only. In principle there's nothing to stop the flag being propagated to the RTL CFG also, in which case it'd probably be set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g. EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which inspect the RTL CFG can continue to only check the ABNORMAL flag. But so far (in rather limited testing!), that has not been observed to be necessary. (We can control RTL CFG manipulation indirectly by using the CANNOT_COPY_INSN_P target hook, sensitive e.g. to the broadcast instruction.) For the GIMPLE CFG (i.e. in passes operating on GIMPLE form), EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e., inhibiting certain optimisations), and so has been added to relevant conditionals largely mechanically. Places where it is treated specially are: * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on outgoing edges of a block concluding with a GIMPLE_COND statement. But, we allow EDGE_TO_RECONVERGENCE there. * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass tries to split the edge later). There are probably other optimisations that will be tripped up by the new flag along the same lines as the VRP tweak above, which we will no doubt discover in due course. Together with the patch, https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html This shows no regressions for the libgomp tests. OK for gomp4 branch? Thanks, Julian ChangeLog gcc/ * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag. (bb_hash_abnorm_or_reconv_pred): New function. (hash_abnormal_or_eh_outgoing_edge_p): Consider EDGE_TO_RECONVERGENCE also. * cfg-flags.def (TO_RECONVERGENCE): Add flag. * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges leading to a reconvergence point. * cfgbuild.c (purge_dead_tablejump_edges): Consider EDGE_TO_RECONVERGENCE. * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise. * cfgexpand.c (expand_gimple_tailcall, construct_exit_block) (pass_expand::execute): Likewise. * cfghooks.c (can_copy_bbs_p): Likewise. * cfgloop.c (bb_loop_header_p): Likewise. * cfgloopmanip.c (scale_loop_profile): Likewise. * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise. * graph.c (draw_cfg_node_succ_edges): Likewise. * graphite-scope-detection.c
Re: [gomp4] Preserve NVPTX reconvergence points
On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote: For NVPTX, it is vitally important that the divergence of threads within a warp can be controlled: in particular we must be able to generate code that we know reconverges at a particular point. Unfortunately GCC's middle-end optimisers can cause this property to be violated, which causes problems for the OpenACC execution model we're planning to use for NVPTX. As a brief example: code running in vector-single mode runs on a single thread of a warp, and must broadcast condition results to other threads of the warp so that they can follow along and be ready for vector-partitioned execution when necessary. I think the lowering of this already at ompexp time is premature, I think much better would be to have a function attribute (or cgraph flag) that would be set for functions you want to compile this way (plus a targetm flag that the targets want to support it that way), plus a flag in loop structure for the acc loop vector loops (perhaps the current OpenMP simd loop flags are good enough for that), and lower it somewhere around the vectorization pass or so. Or, what exactly do you emit for the fallback code, or for other GPGPUs or XeonPhi? To me e.g. for XeonPhi or HSA this sounds like you want to implement the acc loop gang as a work-sharing loop among threads (like #pragma omp for) and #pragma acc loop vector like a loop that should be vectorized if at all possible (like #pragma omp simd). I really think it is important that OpenACC GCC support is not so strongly tied to one specific GPGPU, and similarly OpenMP should be usable for all offloading targets GCC supports. That way, it is possible to auto-vectorize the code too, decision how to expand the code of offloaded function is done already separately for each offloading target, there is a space for optimizations on much simpler cfg, etc. Jakub
Re: [gomp4] Preserve NVPTX reconvergence points
Hi! On Thu, 28 May 2015 16:20:11 +0200, Jakub Jelinek ja...@redhat.com wrote: On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote: [...] I think the lowering of this already at ompexp time is premature Yes, we're aware of this wart. :-| I think much better would be to have a function attribute (or cgraph flag) that would be set for functions you want to compile this way (plus a targetm flag that the targets want to support it that way), plus a flag in loop structure for the acc loop vector loops (perhaps the current OpenMP simd loop flags are good enough for that), and lower it somewhere around the vectorization pass or so. Moving the loop lowering/expansion later is along the same lines as we've been thinking. Figuring out how the OpenMP simd implementation works, is another thing I wanted to look into. Or, what exactly do you emit for the fallback code, or for other GPGPUs or XeonPhi? To me e.g. for XeonPhi or HSA this sounds like you want to implement the acc loop gang as a work-sharing loop among threads (like #pragma omp for) and #pragma acc loop vector like a loop that should be vectorized if at all possible (like #pragma omp simd). I really think it is important that OpenACC GCC support is not so strongly tied to one specific GPGPU Not disagreeing, but: we have to start somewhere. GPU offloading and all its peculiarities is still entering unknown terriroty in GCC; we're still learning, and shall try to converge the emerging different implementations in the future. Doing the completely generic (agnostic of specific offloading device) implementation right now is a challenging task, hence the work on a nvptx-specific prototype first, to put it this way. That said, we of course very much welcome your continued review of our work, and your suggestions! and similarly OpenMP should be usable for all offloading targets GCC supports. That way, it is possible to auto-vectorize the code too, decision how to expand the code of offloaded function is done already separately for each offloading target, there is a space for optimizations on much simpler cfg, etc. Grüße, Thomas pgpNo5kt_UfFt.pgp Description: PGP signature
Re: [gomp4] Preserve NVPTX reconvergence points
On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote: I think much better would be to have a function attribute (or cgraph flag) that would be set for functions you want to compile this way (plus a targetm flag that the targets want to support it that way), plus a flag in loop structure for the acc loop vector loops (perhaps the current OpenMP simd loop flags are good enough for that), and lower it somewhere around the vectorization pass or so. Moving the loop lowering/expansion later is along the same lines as we've been thinking. Figuring out how the OpenMP simd implementation works, is another thing I wanted to look into. The OpenMP simd expansion is actually quite simple thing. Basically, the simd loop is in ompexp expanded as a normal loop with some flags in the loop structure (which are pretty much optimization hints). There is a flag that the user would really like to vectorize it, and another field that says (from what user told) what vectorization factor is safe to use regardless of compiler's analysis. There is some complications with privatization clauses, so some variables are in GIMPLE represented as arrays with maximum vf elements and indexed by internal function (simd lane), which the vectorizer then either turns into a scalar again (if the loop isn't vectorized), or vectorizes it and for addressables keeps in arrays with actual vf elements. I admit I don't know too much about OpenACC, but I'd think doing something similar (i.e. some loop structure hint or request that a particular loop is vectorized and perhaps something about lexical forward/backward dependencies in the loop) could work. Then for XeonPhi or host fallback, you'd just use normal vectorizer. And for PTX you could instead about the same time instead of vectorization lower code to a single working thread doing stuff except for simd marked loops which would be lowered to run on all threads in the warp. Not disagreeing, but: we have to start somewhere. GPU offloading and all its peculiarities is still entering unknown terriroty in GCC; we're still learning, and shall try to converge the emerging different implementations in the future. Doing the completely generic (agnostic of specific offloading device) implementation right now is a challenging task, hence the work on a nvptx-specific prototype first, to put it this way. I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. E.g. the XeonPhi is not hard to understand, it is pretty much just a many core x86_64 chip where the offloading is some process how to run something on the other device and the emulation mode very well emulates that through running it in a different process. This stuff is already about what happens in offloaded code, so considerations for it are similar to those for host code (especially hosts that can vectorize). As far as OpenMP / PTX goes, I'll try to find time for it again soon (busy with OpenMP 4.1 work so far), but e.g. the above stuff (having a single thread in warp do most of the non-vectorized work, and only use other threads in the warp for vectorization) is definitely what OpenMP will benefit from too. Jakub