https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87622
Tobias Burnus <burnus at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |burnus at gcc dot gnu.org --- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> --- On the coarray library interaction, we have: * On startup: - the coarray library is initialized - both coarrays are registed with the library i.e. the library allocates the memory for the program * At the end: the coarray library is told to finish Otherwise, the program itself just calls _gfortran_matmul_r4. The memory for variable "c" is in static memory while "a" and "b" are allocated by the coarray run-time library. If "b" (and "a") are allocated in some slower memory part, it matters how one sums over the variables in the matmul loop. For instance, for A = BC, one can calculate it as: a(i,j) = sum_k b(i,k)c(k,j) in that case, b(i,:) is contiguous in memory (which is faster) and c(:,j) is not (which is slower). [I have not checked in which order _gfortran_matmul_r4 calculates that the loop.] In the simplest case, the memory of "B" is slower because part of the page is accessed by two processors and it keeps getting kicked out of the level 1 or level 2 cache - and, hence, needs to be loaded from the the L3/L4 cache or the normal RAM. Besides those reasons, there is also an overhead of generating multiple jobs and synchronizing them - i.e. waiting until all processes have called "init" and all processes have called "finished" in the library. - With many processes, this overall barrier (all waiting for all) might take a very sizable amount of the total run time. For 1 vs. 2 jobs, it should be negligible but still some libraries have an explicit short cut for 1 job.