On Wed, Jun 24, 2015 at 03:11:04PM +0200, Bernd Schmidt wrote: > On 06/19/2015 03:45 PM, Jakub Jelinek wrote: > > >If the loop remains in the IL (isn't optimized away as unreachable or > >isn't removed, e.g. as a non-loop - say if it contains a noreturn call), > >the flags on struct loop should be still there. For the loop clauses > >(reduction always, and private/lastprivate if addressable etc.) for > >OpenMP simd / Cilk+ simd we use special arrays indexed by internal > >functions, which then during vectorization are shrunk (but in theory could > >be expanded too) to the right vectorization factor if vectorized, of course > >accesses within the loop vectorized using SIMD, and if not vectorized, > >shrunk to 1 element. > > I'd appreciate if you could describe that mechanism in more detail. As far > as I can tell it is very poorly commented and documented in the code. I > mean, it doesn't even follow the minimal coding standards of describing > function inputs: > > /* Helper function of lower_rec_input_clauses, used for #pragma omp simd > privatization. */ > > static bool > lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf, > tree &idx, tree &lane, tree &ivar, tree &lvar)
Here is the theory behind it: https://gcc.gnu.org/ml/gcc-patches/2013-04/msg01661.html In the end it is using internal functions instead of uglified builtins. I'd suggest you look at some of the libgomp.c/simd*.c tests, say with -O2 -mavx2 -fdump-tree-{omplower,ssa,ifcvt,vect,optimized} to see how it is lowered and expanded. I assume #pragma omp simd roughly corresponds to #pragma acc loop vector, maxvf for PTX vectorization is supposedly 32 (warp size). For SIMD vectorization, if the vectorization fails, the arrays are shrunk to 1 element, otherwise they are shrunk to the vectorization factor, and later optimizations if they aren't really addressable optimized using FRE and other memory optimizations so that they don't touch memory unless really needed. For the PTX style vectorization (parallelization between threads in a warp), I'd say you would always shrink to 1 element again, but such variables would be local to each of the threads in the warp (or another possibility is shared arrays of size 32 indexed by %tid.x & 31), while addressable variables without such magic type would be shared among all threads; non-addressable variables (SSA_NAMEs) depending on where they are used. You'd need to transform reductions (which are right now represented as another loop, from 0 to an internal function, so easily recognizable) into the PTX reductions. Also, lastprivate is now an access to the array using last lane internal function, dunno what that corresponds to in PTX (perhaps also a reduction where all but the thread executing the last iteration say or in 0 and the remaining thread ors in the lastprivate value). Jakub