https://bugs.llvm.org/show_bug.cgi?id=43998
Bug ID: 43998
Summary: Poor performance of OpenMP distribute construct
Product: OpenMP
Version: unspecified
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Clang Compiler Support
Assignee: unassignedclangb...@nondot.org
Reporter: csda...@lbl.gov
CC: llvm-bugs@lists.llvm.org
The OpenMP distribute construct performs significantly worse than manually
dividing loop iterations between thread teams. Please see the test program
below which shows the performance of both methods on a system with Intel
Skylake CPUs and NVIDIA V100 GPUs. The performance difference is ~700x. I am
using LLVM/Clang from Nov 11 2019, although there is the same issue when using
LLVM/Clang from Aug 28 2019.
$ make
clang++ -std=c++11 -Ofast -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o
test.exe test.cpp
$ srun -n 1 ./test.exe
Number of sites = 1048576
Executing 100 iterations
Time w/distribute = 2.087 seconds
Time workaround = 0.003 seconds
$ cat test.cpp
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <chrono>
typedef std::chrono::system_clock Clock;
#define ITERATIONS 100
#define TOTAL_SITES 1048576
int main(int argc, char *argv[])
{
int total_sites = TOTAL_SITES;
printf("Number of sites = %d\n", total_sites);
printf("Executing %d iterations\n", ITERATIONS);
auto tstart = Clock::now();
for (int iters=0; iters<ITERATIONS; ++iters) {
#pragma omp target teams distribute
for(int i=0; i<total_sites; ++i) {
;
}
}
double sec =
std::chrono::duration_cast<std::chrono::microseconds>(Clock::now()-tstart).count()
/ 1.0E6;
printf("Time w/distribute = %.3f seconds\n", sec);
tstart = Clock::now();
for (int iters=0; iters<ITERATIONS; ++iters) {
#pragma omp target teams
{
int total_teams = omp_get_num_teams();
int team_id = omp_get_team_num();
int sites_per_team = (total_sites + total_teams - 1) / total_teams;
int istart = team_id * sites_per_team;
if (istart > total_sites) istart = total_sites;
int iend = istart + sites_per_team;
if (iend > total_sites) iend = total_sites;
/* This is the total_sites loop manually chopped up */
for (int i = istart; i < iend; ++i) {
;
}
}
}
sec =
std::chrono::duration_cast<std::chrono::microseconds>(Clock::now()-tstart).count()
/ 1.0E6;
printf("Time workaround = %.3f seconds\n", sec);
}
The performance of the distribute construct can be improved by reducing the
number of teams using the num_teams clause. However, the performance is never
competitive compared to manually dividing loop iterations between thread teams.
Thanks,
Chris
--
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs