Gentle ping;) Could someone please have a glance? Regards, Igor

-----Original Message----- From: Venevtsev, Igor Sent: Friday, July 21, 2017 2:04 PM To: gcc-patches@gcc.gnu.org Cc: Jakub Jelinek <ja...@redhat.com> Subject: [libgomp] Doc update - TASKLOOP/GOMP_doacross_ description Hi! This patch adds an Implementing-TASKLOOP-construct section as well as a description for GOMP_doacross_ runtime routines family to Implementing-FOR-Construct section. (I checked that 'make info' and 'make html' produce correct doc output) 2017-07-21 Igor Venevtsev <igor.venevt...@intel.com> * libgomp.texi (Implementing-TASKLOOP-Construct): New section (Implementing-FOR-Construct): Add documentaion for GOMP_doacross_ runtime routines family diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 230720f..1f17014 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -3096,6 +3096,7 @@ presented by libgomp. Only maintainers should need them. * Implementing SECTIONS construct:: * Implementing SINGLE construct:: * Implementing OpenACC's PARALLEL construct:: +* Implementing TASKLOOP construct:: @end menu @@ -3354,7 +3355,7 @@ Note that while it looks like there is trickiness to propagating a non-constant STEP, there isn't really. We're explicitly allowed to evaluate it as many times as we want, and any variables involved should automatically be handled as PRIVATE or SHARED like any other -variables. So the expression should remain evaluable in the +variables. So the expression should remain evaluable in the subfunction. We can also pull it into a local variable if we like, but since its supposed to remain unchanged, we can also not if we like. @@ -3367,7 +3368,197 @@ of these routines. There are separate routines for handling loops with an ORDERED clause. Bookkeeping for that is non-trivial... +...Yep! But let's try! +Loops with ORDERED clause @strong{with param} '#pragma omp for ordered +@strong{(N)}' do present so-called DOACROSS parallelism and axpanded to @strong{GOMP_doacross_...} family of runtime routines. +Loops with ORDERED clause @strong{without param} '#pragma omp for ordered' are expanded to @strong{GOMP_loop_ordered_...} routine family. +This section describes only the @strong{GOMP_doacross_} family so far. + +A small intoduction into Do-across parallelsism terms and syncronisation model could be found here: +@uref{https://en.wikipedia.org/wiki/Loop-level_parallelism#DOACROSS_par +allelism} + +Another explanation of doacross parallelism from libgomp author Jakub Jelinek could be found here: +@uref{https://developers.redhat.com/blog/2016/03/22/what-is-new-in-open +mp-4-5-3/} + +OMP v4.5 expresses wait() and post() operations by @strong{#pragma omp +ordered depend(sink:...)} and @strong{#pragma omp ordered depend(source)} constructs. These constructs are lowered to GOMP_doacross_wait() and GOMP_doacross_post() correspondingly, see example below. + +Here is an DOACROSS loop example from OpenMP v4.5 official examples (@uref{https://github.com/OpenMP/Examples}). +@*Note that ordered for clause has a parameter, '#pragma omp for ordered@strong{(2)}'. + + +@smallexample +/* +* @@name: doacross.2.c +* @@type: C +* @@compilable: yes +* @@linkable: no +* @@expect: success +*/ +float foo(int i, int j); +float bar(float a, float b, float c); +float baz(float b); + +void work( int N, int M, float **A, float **B, float **C ) @{ + int i, j; + + #pragma omp for ordered(2) + for (i=1; i<N; i++) + @{ + for (j=1; j<M; j++) + @{ + A[i][j] = foo(i, j); + + #pragma omp ordered depend(sink: i-1,j) depend(sink: i,j-1) + B[i][j] = bar(A[i][j], B[i-1][j], B[i][j-1]); #pragma omp + ordered depend(source) + + C[i][j] = baz(B[i][j]); + @} + @} +@} + +@end smallexample + +It becomes + +@smallexample +void work( int N, int M, float **A, float **B, float **C ) @{ + long counts[2]; // counts has a size N from ORDERED(N) clause, i.e 2 in our case + long orditera[2]; // params to pass to GOMP_doacross_post + long istart, iend; + + counts[0] = N-1; + counts[1] = M-1; + ... + if (GOMP_doacross_static_start(2, counts, 0, &istart, &iend)) @{ // initalize doacross workshare structure here + do @{ + for (i = istart + 1; i < iend; ++i) @{ // only outer loop has been parallelized + for (j = 1; j < M; ++j) @{ + A[i][j] = foo(i, j); + GOMP_doacross_wait(i-1,j-1); + B[i][j] = bar(A[i][j], B[i-1][j], B[i][j-1]); + orditera[0] = i; + orditera[1] = j; + GOMP_doacross_post(&orditera[0]); + @} + @} + @} while(GOMP_loop_static_next(&istart, &iend)); + @} + GOMP_loop_end(); +@} + +@end smallexample + +Also note that like the GOMP_doacross_wait arguments, the +GOMP_doacross_post array indices are 0 based, i.e. first iteration for ordered(2) loop is @{ 0, 0 @} no matter what the actual value of the iterators is. + +Perhaps for trying to understand what the ABI does would be better (but +perhaps too complicated for the documentation snippets) to work with loops with incr expressions different from 1, something like: + +@smallexample +void work( int N, int M, float **A, float **B, float **C ) @{ + int i, j; + + #pragma omp for ordered(2) + for (i=4; i<N; i += 2) + @{ + for (j=4; j<M; j += 2) + @{ + A[i][j] = foo(i, j); + #pragma omp ordered depend(sink: i-2,j) depend(sink: i,j-2) + B[i][j] = bar(A[i][j], B[i-2][j], B[i][j-2]); + #pragma omp ordered depend(source) + C[i][j] = baz(B[i][j]); + @} + @} +@} + +@end smallexample + +and then also try something with negative increments. + +@table @asis +@item @emph{Prototypes}: +Post and Wait routines +@*@strong{void +GOMP_doacross_post(long* count);} // array of counts to post. Array +size is N from ordered(N) clause @*@strong{void GOMP_doacross_wait(long +first, ...);} // vararg counts to wait. Count of passed arguments is N +from orderd(N) clause + +The *_start routines are called when first encountering a loop +construct that is not bound directly to a parallel construct. The +first thread that arrives will create the work-share construct; +subsequent threads will see the construct exists and allocate work from it. +START, END, INCR are the bounds of the loop; due to the restrictions of +OpenMP, these values must be the same in every thread. This is not +verified (nor is it entirely verifiable, since START is not necessarily +retained intact in the work-share data structure). CHUNK_SIZE is the +scheduling parameter; again this must be identical in all threads. +Returns true if there's any work for this thread to perform. If so, +*ISTART and *IEND are filled with the bounds of the iteration block +allocated to this thread. Returns false if all work was assigned to +other threads prior to this thread's arrival. +The *_doacross_*_start routines are similar. The only difference is +that this work-share construct is initialized to expect an ORDERED(N) - +DOACROSS section, and the worksharing loop iterates always from 0 to +COUNTS[0] - 1 and other COUNTS array elements tell the library number +of iterations in the ordered inner loops. + +Also, if the doacross loop is collapse(n) for n > 1, then +GOMP_doacross_static_start still has all the collapsed loops as a +single counts element (the 0th) and istart/iend is again from 0 to that count, so the compiler generated code needs to compute all the collapsed loop iterators from that after GOMP_doacross_static_@{start,next@}. +@*@strong{bool +GOMP_loop_doacross_static_start (unsigned ncounts, long *counts, + long chunk_size, long *istart, long *iend);} + +The *_next routines are called when the thread completes processing of +the iteration block currently assigned to it. If the work-share +construct is bound directly to a parallel construct, then the iteration +bounds may have been set up before the parallel. In which case, this +may be the first iteration for the thread. +Returns true if there is work remaining to be performed; *ISTART and +*IEND are filled with a new iteration block. Returns false if all work +has been assigned. +@*@strong{bool +GOMP_loop_doacross_static_next (long *istart, long *iend)}; + +The following routines have the same semantics and signatures as *_doacross_static_* routines described above. +Routine type (static, dynamic, auto, guided) is selected according SCHEDULE(modifier) clause. Default is static. +@*@strong{bool +GOMP_loop_doacross_dynamic_start (unsigned ncounts, long *counts, + long chunk_size, long *istart, long *iend);} @*@strong{bool +GOMP_loop_doacross_guided_start (unsigned ncounts, long *counts, + long chunk_size, long *istart, long *iend);} + +This routine chooses loop scheduling type (static, dynamic or guided) +at runtime according to OMP_SCHEDULE ICV value @*@strong{bool +GOMP_loop_doacross_runtime_start (unsigned ncounts, long *counts, + long chunk_size, long *istart, long *iend);} + +The *_ull* variants operate on unsigned long long int: +@*@strong{typedef unsigned long long gomp_ull;} +@*@strong{GOMP_loop_ull_doacross_dynamic_start(unsigned ncounts, gomp_ull *counts, + gomp_ull chunk_size, gomp_ull *istart, gomp_ull +*iend);} @*@strong{GOMP_loop_ull_doacross_guided_start(unsigned ncounts, gomp_ull *counts, + gomp_ull chunk_size, gomp_ull *istart, gomp_ull +*iend);} @*@strong{GGOMP_loop_ull_doacross_runtime_start(unsigned ncounts, gomp_ull *counts, + gomp_ull chunk_size, gomp_ull *istart, gomp_ull +*iend);} @*@strong{GGOMP_loop_ull_doacross_static_start(unsigned ncounts, gomp_ull *counts, + gomp_ull chunk_size, gomp_ull *istart, gomp_ull +*iend);} + +The *_nonmonotonic* loop constructs are just aliases for now that are +told it doesn't have to be monotonic (so OpenMP 5.0 programs will use +those by default rather than the old ones, unless monotonic clause is +used) @*@strong{GOMP_loop_nonmonotonic_dynamic_start ==> +GOMP_loop_dynamic_start} @*@strong{GOMP_loop_nonmonotonic_dynamic_next +==> GOMP_loop_dynamic_next} +@*@strong{GOMP_loop_nonmonotonic_guided_start ==> +GOMP_loop_guided_start} @*@strong{GOMP_loop_nonmonotonic_guided_next +==> GOMP_loop_guided_next} + +@end table @node Implementing ORDERED construct @section Implementing ORDERED construct @@ -3469,6 +3660,83 @@ becomes +@node Implementing TASKLOOP construct +@section Implementing TASKLOOP construct + +The code below is the part of OpenMP v4.5 official examples. +(@uref{https://github.com/OpenMP/Examples}). +@smallexample + +/* + * @@name: taskloop.c + * @@type: C + * @@compilable: yes + * @@linkable: no + * @@expect: success + */ +void long_running_task(void); +void loop_body(int i, int j); + +void parallel_work(void) @{ + int i, j; + #pragma omp taskgroup + @{ + #pragma omp task + long_running_task(); // can execute concurrently + + #pragma omp taskloop private(j) grainsize(500) nogroup + for (i = 0; i < 10000; i++) @{ // can execute concurrently + for (j = 0; j < i; j++) @{ + loop_body(i, j); + @} + @} + @} +@} +@end smallexample + +becomes + +@smallexample +void subfunction(void* data) @{ + use data + loop_body(); +@} + +void parallel_work() @{ + struct omp_data; + long int _1; + long int _2; + + _1 = 0; + _2 = 10000; + GOMP_taskloop (subfunction, &omp_data, 0B, 16, 8, 3840, 500, 0, _1, +_2, 1); @} @end smallexample + +@table @asis +@item @emph{Prototype}: +void +GOMP_taskloop (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *), + long arg_size, long arg_align, unsigned flags, + unsigned long num_tasks, int priority, + long start, long end, long step) + +@end table +flags can be any combination of + +@smallexample +/* GOMP_task/GOMP_taskloop* flags argument. */ +#define GOMP_TASK_FLAG_UNTIED (1 << 0) +#define GOMP_TASK_FLAG_FINAL (1 << 1) +#define GOMP_TASK_FLAG_MERGEABLE (1 << 2) +#define GOMP_TASK_FLAG_DEPEND (1 << 3) +#define GOMP_TASK_FLAG_PRIORITY (1 << 4) +#define GOMP_TASK_FLAG_UP (1 << 8) +#define GOMP_TASK_FLAG_GRAINSIZE (1 << 9) +#define GOMP_TASK_FLAG_IF (1 << 10) +#define GOMP_TASK_FLAG_NOGROUP (1 << 11) +@end smallexample + @c --------------------------------------------------------------------- @c Reporting Bugs @c --------------------------------------------------------------------- Is it OK for trunk? Regards, Igor