On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> The performance I saw was lower by a factor of 80 or so compared to their CUDA
> version, and even lower than OpenMP on the host.
The currently published OpenMP version of LULESH simply doesn't use openmp-simd
anywhere. This should make it obvious that it won't be anywhere near any
reasonable CUDA implementation, and also bound to be below host performance.
Besides, it's common for such benchmark suites to have very different levels of
hand tuning for the native-CUDA implementation vs OpenMP implementation,
sometimes to the point of significant algorithmic differences. So you're
making an invalid comparison here.
Internally at ISP RAS we used a small set of microbenchmarks implemented in
CUDA/OpenACC/OpenMP specifically for the purpose of evaluating the exact same
computations implemented in terms of different APIs. We got close performance in
all three. The biggest issue is visible on short-running OpenMP target regions:
the startup cost (going through libgomp) is non-trivial. That can be improved
with further changes in libgomp port, notably avoiding malloc, shaving off more
code, perhaps inlining more code (e.g. via LTO eventually). There's also
avoidable cuMemAlloc/cuMemFree on the libgomp plugin side.
For example, there's this patch on the branch:
libgomp: avoid malloc calls in gomp_nvptx_main
Avoid calling malloc where it's easy to use stack storage instead: device
malloc is very slow in CUDA. This cuts about 60-80 microseconds from target
region entry/exit time, slimming down empty target regions from ~95 to ~17
microseconds (as measured on a GTX Titan).
(empty CUDA kernel is ~5 microseconds; all figures are taken via nvprof)
> To me this kind of performance doesn't look like something that will be fixed
> by fine-tuning; it leaves me undecided whether the chosen approach (what you
> call the fundamentals) is viable at all.
If you try to draw conclusions just from comparing the performance you got on
LULESH, without looking at benchmark's source (otherwise you should have
acknowledged the lack of openmp-simd and significant source-level differences
between CUDA and OpenMP implementations, like the use of __shared__ in CUDA
algorithms), I am sorry to say, but that is just ridiculous. The implementation
on the branch is far from ideal, but your method of evaluation is nonsensical.
> Performance is still better than the OpenACC version of the benchmark, but
> then I think we shouldn't repeat the mistakes we made with OpenACC and avoid
> merging something until we're sure it's ready and of benefit to users.
Would you kindly try and keep your commentary constructive. It's frustrating to
me to have to tolerate hostilities like an ad hominem attack, ignored
nvptx-backend-related questions, etc. How can the work get ready if all you do
is passively push back? Please trust me, I have experience with GPUs and GCC.
There should be a process for getting this gradually reviewed, with fundamental
design decisions acked and patches reviewed before all tweaks and optimizations
are in place. If you suggest that the work needs to proceed on the branch
without any kind of interim review, and then reviewed in one go after satisfying
some unspecified criteria of being "ready and of benefit", that doesn't sound
right to me.