https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125638
--- Comment #2 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- Well, best would actually be if one does not need to add #pragma omp sind at all. And that the automatic vectorizer chooses the best code. It clearly produces ok code for CPU. But it then also should automatically vectorize the GPU code which is, unfortunately, currently only fast enough with a deliberate omp simd statement. Furthermore: Another surprise happens when you run that benchmark on clang. Especially on GPU. Clang has no support for openmp simd. But they had help from Nvidia to automatically handle these... I must run it again, but I think to remember, with clang and without optimisations, last time I got 5 ms on GPU with clang for that benchmark, not 28 like on GCC with -o3 and simd or 132 ms without simd on GCC... So there is clearly something wrong for the GPU branch.
