Re: [pocl-devel] PoCL on Intel Xeon Phi (KNL)

Pekka Jääskeläinen Thu, 18 Aug 2016 00:51:52 -0700

Hi,

Great to see someone stepping up to actively optimize the performance of
pocl!


Unfortunately our group has been very busy with other tasks (some of them
pocl-related though) and haven't had the time to focus on the kernel
compilation pretty much at all.  There are several low hanging fruits which
would improve the implicit vectorization perf. and might not require too
much effort.

Hopefully we can allocate some work time for this side soon. Anyways, I'm
glad to help and provide guidance in the optimization effort. If you prefer
real time guidance, please join #pocl at irc.oftc.net. But this mailing list
is fine too.


Some things to do to improve the autovectorization performance:

- Some time back there was effort in LLVM to define a sort of
vectorized library mechanism that would also work with the vectorizers of
LLVM. That is, it could autovectorize scalar built-in calls to vectorized
versions automatically. It would be good to check what's the status of that
to avoid possible rework later.  Currently the optimized vectorized
builtins of Erik are not used when autovectorizing, thus they improve perf
only with explicit vector data type kernels.

- Easy one: I noted that currently pocl leaves the calls to the pseudo
barriers intact and they actually propagate down to the final binary. This
is wasteful and likely stops some optimizations due to the calls in between
parallel (loop) regions.

- The annoying "target address space" pass should be removed, if possible.
We should try to utilize the new kernel attributes that tell the OpenCL
original address spaces for whatever we need them for and use the target's
address space from the start (instead of the fake addr space) for the IR.
This alone is possibly stopping some optimizations as the vectorizers might
get confused of the OpenCL AS IDs. A workaround is to rerun the
optimizations after TAS, but a cleaner version would be to not need this
nasty thing at all.

- In the longer run, the parallel region formation phase should be reworked
to reduce the duplication in tricky cases. I started to do this some 2 years
ago with some good results, but got distracted and the branch
got rot.

There are plenty more which I cannot immediately remember.

On 08/17/2016 09:45 PM, Matthias Noack wrote:
> Currently, I get messages like: remark: <unknown>:0:0: loop not 
> vectorized: value that could not be identified as reduction is used 
> outside the loop remark: <unknown>:0:0: loop not vectorized: use 
> -Rpass-analysis=loop-vectorize for more info so it seems that LLVM has 
> trouble vectorising kernel (while Intel OpenCL does).
> 
> Any hint on how I can pass through that 
> "-Rpass-analysis=loop-vectorize"?

I added this when I looked at optimizing the kernel compiler the last time:

  POCL_VECTORIZER_REMARKS

  When set to 1, prints out remarks produced by the loop vectorizer of LLVM
  during kernel compilation.

(http://portablecl.org/docs/html/env_variables.html)

Hopefully it still works.

BTW autotools is deprecated and is likely to be removed in the beginning
of the next release cycle, so please update CMake build files only.

Also, Michal's distro mode is a very nice feature, here's the wiki
where we discussed the problem it tries to solve:
https://github.com/pocl/pocl/wiki/Install-time-or-run-time-built-kernel-builtin-libraries

BR,
-- 
Pekka

------------------------------------------------------------------------------
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] PoCL on Intel Xeon Phi (KNL)

Reply via email to