Hi, Great to see someone stepping up to actively optimize the performance of pocl!
Unfortunately our group has been very busy with other tasks (some of them pocl-related though) and haven't had the time to focus on the kernel compilation pretty much at all. There are several low hanging fruits which would improve the implicit vectorization perf. and might not require too much effort. Hopefully we can allocate some work time for this side soon. Anyways, I'm glad to help and provide guidance in the optimization effort. If you prefer real time guidance, please join #pocl at irc.oftc.net. But this mailing list is fine too. Some things to do to improve the autovectorization performance: - Some time back there was effort in LLVM to define a sort of vectorized library mechanism that would also work with the vectorizers of LLVM. That is, it could autovectorize scalar built-in calls to vectorized versions automatically. It would be good to check what's the status of that to avoid possible rework later. Currently the optimized vectorized builtins of Erik are not used when autovectorizing, thus they improve perf only with explicit vector data type kernels. - Easy one: I noted that currently pocl leaves the calls to the pseudo barriers intact and they actually propagate down to the final binary. This is wasteful and likely stops some optimizations due to the calls in between parallel (loop) regions. - The annoying "target address space" pass should be removed, if possible. We should try to utilize the new kernel attributes that tell the OpenCL original address spaces for whatever we need them for and use the target's address space from the start (instead of the fake addr space) for the IR. This alone is possibly stopping some optimizations as the vectorizers might get confused of the OpenCL AS IDs. A workaround is to rerun the optimizations after TAS, but a cleaner version would be to not need this nasty thing at all. - In the longer run, the parallel region formation phase should be reworked to reduce the duplication in tricky cases. I started to do this some 2 years ago with some good results, but got distracted and the branch got rot. There are plenty more which I cannot immediately remember. On 08/17/2016 09:45 PM, Matthias Noack wrote: > Currently, I get messages like: remark: <unknown>:0:0: loop not > vectorized: value that could not be identified as reduction is used > outside the loop remark: <unknown>:0:0: loop not vectorized: use > -Rpass-analysis=loop-vectorize for more info so it seems that LLVM has > trouble vectorising kernel (while Intel OpenCL does). > > Any hint on how I can pass through that > "-Rpass-analysis=loop-vectorize"? I added this when I looked at optimizing the kernel compiler the last time: POCL_VECTORIZER_REMARKS When set to 1, prints out remarks produced by the loop vectorizer of LLVM during kernel compilation. (http://portablecl.org/docs/html/env_variables.html) Hopefully it still works. BTW autotools is deprecated and is likely to be removed in the beginning of the next release cycle, so please update CMake build files only. Also, Michal's distro mode is a very nice feature, here's the wiki where we discussed the problem it tries to solve: https://github.com/pocl/pocl/wiki/Install-time-or-run-time-built-kernel-builtin-libraries BR, -- Pekka ------------------------------------------------------------------------------ _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
