Hi Erik,
thanks for your reply.
On 17.08.2016 19:37, Erik Schnetter wrote:
On Wed, Aug 17, 2016 at 11:16 AM, Matthias Noack
<[email protected] <mailto:[email protected]>> wrote:
pocl will compile the OpenCL kernel library at build time. This is the
support library containing the definitions of functions such as sin,
cos, sqrt, etc., including their vector counterparts. For best
performance, you need architecture-optimized version of this library.
I understand. I was wondering about the origin of the math built-ins as
the library mentions some external library. The documentation mentions
Vecmathlib, which seems to be written by you. Is it always used or do I
need to activate it somehow?
A colleague of mine and I recently looked into different SIMD coding
techniques using gcc, clang and the Intel compiler. Intel has its
libsvml for vectorised math functions, GNU comes with libmvec in newer
glibc versions, but LLVM/clang seems to lack an equivalent. Manual
vectorisation with intrinsic-wrapping C++ class libraries like Vc, which
comes with its own math function implementations, were the only way to
get good performance with LLVM/clang.
On a side node, pocl's support for AVX-512 was implemented targeting
the Intel MIC architecture found e.g. on Stampede.
Just to be sure we are on the same page on for everyone else who might
read this, some terminology and facts:
- MIC (Many Integrated Cores) is the architecture on which the Intel
Xeon Phi product line is based
- KNC (Knights Corner), was the first Intel Xeon Phi 71xx product line
- coprocessor only, almost x86-64, but own binary format ('k1om',
because no SSE2 registers as needed by x86-64 calling convention)
- 512-bit SIMD units, but *not* AVX-512
- cross compilation with Intel compiler only for applications (and
a patched gcc for it's Linux-based OS)
- Intel OpenCL with KNC-SIMD instructions is available but was
discontinued in newer releases
- KNL (Knights Landing), is the current Intel Xeon Phi 72xx product line
(officially released at ISC'16, in June)
- bootable CPU, fully x86-64 compatible (i.e. a
- AVX512, and everything before
- all x86-64 compilers and frameworks work, but performance depends
on AVX-512 support and platform-specific optimisations
=> lots of stuff to try
- no official Intel OpenCL support, but x86-64 SDK kind of works
with with AVX2 (not competitive with OpenMP performance)
Here are some very early numbers comparing AVX2 and AVX-512 using
basically the same benchmarks with mostly OpenMP, which I now use for
OpenCL:
https://drive.google.com/open?id=0B9D5EnxRqcaZaU1vbWJHUklMSWs
I don't know whether the compiler intrinsics are identical to the
current version of AVX-512. If not, and if the kernel library is
important to you, then I will be happy to assist updating the
respective parts of pocl.
There are slight differences, but its not much. Sadly, I don't know of
any publicly available document listing them, only the Intel intrinsics
guide:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Also there is "AVX-512 Common", the portable portion, "AVX512 MIC" with
some extensions for numerical work-loads, and some others.
For PoCL on KNL, the first question is: Is the KNC MIC-implementation of
the kernel library usable on KNL, and is it already used? My best guess
is, that PoCL's build-system won't use it if built natively on a KNL
system. So maybe we should try to enforce it, and see what happens, and
fix it if necessary.
Since I do not (yet!) have access to a KNL system, this might involve
some trial and error.
Can't provide you with direct access. ;-) But I guess we could work
together in a PoCL fork on GitHub and I can run tests as needed.
Currently, I get messages like:
remark: <unknown>:0:0: loop not vectorized: value that could not be
identified as reduction is used outside the loop
remark: <unknown>:0:0: loop not vectorized: use
-Rpass-analysis=loop-vectorize for more info
so it seems that LLVM has trouble vectorising kernel (while Intel OpenCL
does).
Any hint on how I can pass through that "-Rpass-analysis=loop-vectorize"?
Performance for basic arithmetic operators and FMA is off by 4 to 5x
(i.e. slower than Intel OpenCL with AVX2) - hopefully that's the 4x
vectorisation advantage of Intel OpenCL. Built-in runtimes are close for
e.g. exp(), but off by > 10x for log().
Well, any input is welcome. :-)
Cheers,
Matthias
------------------------------------------------------------------------------
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel