Hello Timo,
I'm glad to hear you are willing to contribute to the cause of
open and performance portable OpenCL.
Beware, though, some of the kernel compiler needs major rewrites for
clarity, and unfortunately there are only a few people working on the kernel
compiler. But hopefully soon we can count you in as one :)
This reminds me that I should really write the "how to tune and hack the
pocl kernel compiler" document.
Maybe this is a starter for that:
There are several useful environment variables for debugging and analyzing
the kernel compiler optimizations:
http://portablecl.org/docs/html/env_variables.html
First, you can make pocl to dump more debug output from LLVM and its vectorizer:
* POCL_DEBUG_LLVM_PASSES
When set to 1, enables debug output from LLVM passes during optimization.
* POCL_VECTORIZER_REMARKS
When set to 1, prints out remarks produced by the loop vectorizer of LLVM
during kernel compilation.
To debug and analyze the kernel compiler intermediate results closer,
you can instruct pocl to leave the temporary LLVM bitcode files (normally it
deletes them after they are not needed).
POCL_CACHE_DIR, it's useful to set this to a local temp dir which you can
clear up between trials.
POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1
Then after executing your OpenCL app, under your temp dir, you will
find .bc files, the most interesting one being parallel.bc which is
the final IR produced by pocl and LLVM before codegen. If you don't
see vector LLVM IR there, it won't likely appear in your final
binary either.
To start hacking:
http://portablecl.org/docs/html/kernel_compiler.html
Also our pocl paper might provide additional help, but the above link should
give a good overview although it might be outdated (I've added it to my
task list to update it).
The LLVM passes are under lib/llvmopencl. The layer between OpenCL
runtime and the kernel compiler is in files lib/CL/pocl_llvm*.c
Please don't hesitate to ask for further instructions here or in IRC.
BR,
Pekka
On 02/07/2018 02:20 AM, Timo Betcke wrote:
Hi,
we noticed for one of our OpenCL kernels that pocl is over 4 times slower
than the Intel OpenCL runtime on a Xeon W processor. I am assuming it is the
auto vectorizer. How can I debug this and figure out if vectorization across
work items is being performed with pocl? The kernels are running under
PyOpenCL on Ubuntu 16.04 with LLVM 4 and pocl 1.0.
We are planning to distribute our software and would prefer to have good
performance on pocl and not have to rely on the Intel environment.
Best wishes
Timo
--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected] <mailto:[email protected]>
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel
--
Pekka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel