On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote:
> On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> > The performance I saw was lower by a factor of 80 or so compared to their 
> > CUDA
> > version, and even lower than OpenMP on the host.
> 
> The currently published OpenMP version of LULESH simply doesn't use 
> openmp-simd
> anywhere. This should make it obvious that it won't be anywhere near any
> reasonable CUDA implementation, and also bound to be below host performance.

Yeah, perhaps just changing some or all #pragma omp distribute parallel for
into #pragma omp distribute parallel for simd could do something (of course,
one should actually analyze what it does, but if it is valid for distribute
without dist_schedule clause, then the loops ought to be without forward or
backward lexical dependencies (teams can't really synchronize, though they
can use some atomics).
That said, the OpenMP port of LULESH doesn't seem to be done very carefully,
e.g. in CalcHourglassControlForElems I see:
      /* Do a check for negative volumes */
      if ( v[i] <= Real_t(0.0) ) {
        vol_error = i;
      }
There is not any kind of explicit mapping of vol_error nor reduction of it,
so while in OpenMP 4.0 it would be just a possible data race (the var would
be map(tofrom: vol_error) by default and shared between teams/threads, so if
more than one thread decides to write it, it is a data race, in OpenMP 4.5
it is implicitly firstprivate(vol_error) and thus the changes to the var
(still racy) would just never be propagated back to the caller.

For the missing simd regions, it might be helpful if we were able to
"autovectorize" into the SIMT, but I guess that might be quite a lot of
work.

        Jakub

Reply via email to