Re: [pocl-devel] PoCL on Intel Xeon Phi (KNL)

Matthias Noack Wed, 17 Aug 2016 11:47:05 -0700

Hi Erik,

thanks for your reply.


On 17.08.2016 19:37, Erik Schnetter wrote:

On Wed, Aug 17, 2016 at 11:16 AM, Matthias Noack<[email protected] <mailto:[email protected]>> wrote:
pocl will compile the OpenCL kernel library at build time. This is thesupport library containing the definitions of functions such as sin,cos, sqrt, etc., including their vector counterparts. For bestperformance, you need architecture-optimized version of this library.

I understand. I was wondering about the origin of the math built-ins asthe library mentions some external library. The documentation mentionsVecmathlib, which seems to be written by you. Is it always used or do Ineed to activate it somehow?

A colleague of mine and I recently looked into different SIMD codingtechniques using gcc, clang and the Intel compiler. Intel has itslibsvml for vectorised math functions, GNU comes with libmvec in newerglibc versions, but LLVM/clang seems to lack an equivalent. Manualvectorisation with intrinsic-wrapping C++ class libraries like Vc, whichcomes with its own math function implementations, were the only way toget good performance with LLVM/clang.

On a side node, pocl's support for AVX-512 was implemented targetingthe Intel MIC architecture found e.g. on Stampede.

Just to be sure we are on the same page on for everyone else who mightread this, some terminology and facts:

- MIC (Many Integrated Cores) is the architecture on which the IntelXeon Phi product line is based


- KNC (Knights Corner), was the first Intel Xeon Phi 71xx product line

- coprocessor only, almost x86-64, but own binary format ('k1om',because no SSE2 registers as needed by x86-64 calling convention)

    - 512-bit SIMD units, but *not* AVX-512

- cross compilation with Intel compiler only for applications (anda patched gcc for it's Linux-based OS)- Intel OpenCL with KNC-SIMD instructions is available but wasdiscontinued in newer releases

- KNL (Knights Landing), is the current Intel Xeon Phi 72xx product line(officially released at ISC'16, in June)

    - bootable CPU, fully x86-64 compatible (i.e. a
    - AVX512, and everything before

- all x86-64 compilers and frameworks work, but performance dependson AVX-512 support and platform-specific optimisations

        => lots of stuff to try

- no official Intel OpenCL support, but x86-64 SDK kind of workswith with AVX2 (not competitive with OpenMP performance)

Here are some very early numbers comparing AVX2 and AVX-512 usingbasically the same benchmarks with mostly OpenMP, which I now use forOpenCL:

https://drive.google.com/open?id=0B9D5EnxRqcaZaU1vbWJHUklMSWs

I don't know whether the compiler intrinsics are identical to thecurrent version of AVX-512. If not, and if the kernel library isimportant to you, then I will be happy to assist updating therespective parts of pocl.

There are slight differences, but its not much. Sadly, I don't know ofany publicly available document listing them, only the Intel intrinsicsguide:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Also there is "AVX-512 Common", the portable portion, "AVX512 MIC" withsome extensions for numerical work-loads, and some others.

For PoCL on KNL, the first question is: Is the KNC MIC-implementation ofthe kernel library usable on KNL, and is it already used? My best guessis, that PoCL's build-system won't use it if built natively on a KNLsystem. So maybe we should try to enforce it, and see what happens, andfix it if necessary.

Since I do not (yet!) have access to a KNL system, this might involvesome trial and error.

Can't provide you with direct access. ;-) But I guess we could worktogether in a PoCL fork on GitHub and I can run tests as needed.


Currently, I get messages like:

remark: <unknown>:0:0: loop not vectorized: value that could not beidentified as reduction is used outside the loopremark: <unknown>:0:0: loop not vectorized: use-Rpass-analysis=loop-vectorize for more infoso it seems that LLVM has trouble vectorising kernel (while Intel OpenCLdoes).


Any hint on how I can pass through that "-Rpass-analysis=loop-vectorize"?

Performance for basic arithmetic operators and FMA is off by 4 to 5x(i.e. slower than Intel OpenCL with AVX2) - hopefully that's the 4xvectorisation advantage of Intel OpenCL. Built-in runtimes are close fore.g. exp(), but off by > 10x for log().


Well, any input is welcome. :-)

Cheers,
Matthias

------------------------------------------------------------------------------

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] PoCL on Intel Xeon Phi (KNL)

Reply via email to