https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |burnus at gcc dot gnu.org

--- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> ---
On the coarray library interaction, we have:
* On startup:
  - the coarray library is initialized
  - both coarrays are registed with the library
    i.e. the library allocates the memory for the program
* At the end: the coarray library is told to finish


Otherwise, the program itself just calls _gfortran_matmul_r4.

The memory for variable "c" is in static memory while "a" and "b" are allocated
by the coarray run-time library.

If "b" (and "a") are allocated in some slower memory part, it matters how one
sums over the variables in the matmul loop.

For instance, for A = BC, one can calculate it as:
   a(i,j) = sum_k b(i,k)c(k,j)
in that case, b(i,:) is contiguous in memory (which is faster) and c(:,j) is
not (which is slower).  [I have not checked in which order _gfortran_matmul_r4
calculates that the loop.]

In the simplest case, the memory of "B" is slower because part of the page is
accessed by two processors and it keeps getting kicked out of the level 1 or
level 2 cache - and, hence, needs to be loaded from the the L3/L4 cache or the
normal RAM.

Besides those reasons, there is also an overhead of generating multiple jobs
and synchronizing them - i.e. waiting until all processes have called "init"
and all processes have called "finished" in the library. - With many processes,
this overall barrier (all waiting for all) might take a very sizable amount of
the total run time.

For 1 vs. 2 jobs, it should be negligible but still some libraries have an
explicit short cut for 1 job.

Reply via email to