Hi Erik,

thanks for your reply.

On 17.08.2016 19:37, Erik Schnetter wrote:
On Wed, Aug 17, 2016 at 11:16 AM, Matthias Noack <[email protected] <mailto:[email protected]>> wrote:

pocl will compile the OpenCL kernel library at build time. This is the support library containing the definitions of functions such as sin, cos, sqrt, etc., including their vector counterparts. For best performance, you need architecture-optimized version of this library.

I understand. I was wondering about the origin of the math built-ins as the library mentions some external library. The documentation mentions Vecmathlib, which seems to be written by you. Is it always used or do I need to activate it somehow?

A colleague of mine and I recently looked into different SIMD coding techniques using gcc, clang and the Intel compiler. Intel has its libsvml for vectorised math functions, GNU comes with libmvec in newer glibc versions, but LLVM/clang seems to lack an equivalent. Manual vectorisation with intrinsic-wrapping C++ class libraries like Vc, which comes with its own math function implementations, were the only way to get good performance with LLVM/clang.

On a side node, pocl's support for AVX-512 was implemented targeting the Intel MIC architecture found e.g. on Stampede.
Just to be sure we are on the same page on for everyone else who might read this, some terminology and facts:

- MIC (Many Integrated Cores) is the architecture on which the Intel Xeon Phi product line is based

- KNC (Knights Corner), was the first Intel Xeon Phi 71xx product line
- coprocessor only, almost x86-64, but own binary format ('k1om', because no SSE2 registers as needed by x86-64 calling convention)
    - 512-bit SIMD units, but *not* AVX-512
- cross compilation with Intel compiler only for applications (and a patched gcc for it's Linux-based OS) - Intel OpenCL with KNC-SIMD instructions is available but was discontinued in newer releases

- KNL (Knights Landing), is the current Intel Xeon Phi 72xx product line (officially released at ISC'16, in June)
    - bootable CPU, fully x86-64 compatible (i.e. a
    - AVX512, and everything before
- all x86-64 compilers and frameworks work, but performance depends on AVX-512 support and platform-specific optimisations
        => lots of stuff to try
- no official Intel OpenCL support, but x86-64 SDK kind of works with with AVX2 (not competitive with OpenMP performance)

Here are some very early numbers comparing AVX2 and AVX-512 using basically the same benchmarks with mostly OpenMP, which I now use for OpenCL:
https://drive.google.com/open?id=0B9D5EnxRqcaZaU1vbWJHUklMSWs

I don't know whether the compiler intrinsics are identical to the current version of AVX-512. If not, and if the kernel library is important to you, then I will be happy to assist updating the respective parts of pocl.

There are slight differences, but its not much. Sadly, I don't know of any publicly available document listing them, only the Intel intrinsics guide:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Also there is "AVX-512 Common", the portable portion, "AVX512 MIC" with some extensions for numerical work-loads, and some others.

For PoCL on KNL, the first question is: Is the KNC MIC-implementation of the kernel library usable on KNL, and is it already used? My best guess is, that PoCL's build-system won't use it if built natively on a KNL system. So maybe we should try to enforce it, and see what happens, and fix it if necessary.

Since I do not (yet!) have access to a KNL system, this might involve some trial and error.
Can't provide you with direct access. ;-) But I guess we could work together in a PoCL fork on GitHub and I can run tests as needed.

Currently, I get messages like:
remark: <unknown>:0:0: loop not vectorized: value that could not be identified as reduction is used outside the loop remark: <unknown>:0:0: loop not vectorized: use -Rpass-analysis=loop-vectorize for more info so it seems that LLVM has trouble vectorising kernel (while Intel OpenCL does).

Any hint on how I can pass through that "-Rpass-analysis=loop-vectorize"?

Performance for basic arithmetic operators and FMA is off by 4 to 5x (i.e. slower than Intel OpenCL with AVX2) - hopefully that's the 4x vectorisation advantage of Intel OpenCL. Built-in runtimes are close for e.g. exp(), but off by > 10x for log().

Well, any input is welcome. :-)

Cheers,
Matthias
------------------------------------------------------------------------------
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to