Re: [pocl-devel] Debugging auto vectorizer

Jeff Hammond Wed, 07 Feb 2018 08:04:35 -0800

On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <[email protected]>
wrote:


> Hi,
>
> > we noticed for one of our OpenCL kernels that pocl is over 4 times
> > slower than the Intel OpenCL runtime on a Xeon W processor.
>
> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
> is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
> found, i don't have a machine anymore to test it).
>


Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of the
Skylake generation, which I'll refer to as SKX since they are
microarchitecturally the same.  All of these support AVX-512, which I'm
going to refer to as AVX3 in the following, for reasons that will become
clear.

An important detail when evaluating vectorization on these processors is
that the frequency drops when transitioning from scalar/SSE2 code to AVX2
code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm
(128b), ymm (256b), and zmm (512b) registers respectively.  AVX3
instructions with ymm registers should run at AVX2 frequency.

While most (but not all - see [3]) parts have 2 VPUs, the first of these is
implemented via port fusion [4].  What this means is that the core can
dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3
256b instructions (AVX2 or AVX3) on ports 0, 1 and 5.  Thus, one can get
1024b throughput at one frequency or 768b throughput at a slightly higher
frequency.  What this means is that 512b vectorization pays off for code
that is thoroughly compute-bound and heavily vectorized (e.g. dense linear
algebra and molecular dynamics) but that 256b vectorization is likely
better for code that is more memory-bound or doesn't vectorize as well.

The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to address
this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take advantage
of all the AVX3 instructions but favor 256b ymm registers, which will
behave exactly like AVX2 in some cases (i.e. ones where the AVX3
instruction features aren't used).

Anyways, the short version of this story is that you should not assume 512b
SIMD code generation is the reason for a performance benefit from the Intel
OpenCL compiler, since it may in fact not generate those instructions if it
thinks that 256b is better.  It would be useful to force both POCL and
Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see how
they compare when targeting the same vector ISA.  This sort of comparison
would also be helpful to resolve an older bug report of a similar nature
[5].

What I wrote here is one engineer's attempt to summarize a large amount of
information in a user-friendly format.  I apologize for any errors - they
are certainly not intentional.

[1]
https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family
[2]
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
[3] https://github.com/jeffhammond/vpu-count
[4]
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition
[5] https://github.com/pocl/pocl/issues/292


> 2) It could be the autovectorizer, or it could be something else. Are
> your machines NUMA ? if so, you'll likely see very bad performance, as
> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
> unrolls too much and overflows L1 caches (you could try experimenting
> with various local WG sizes to clEnqueueNDRK). Unfortunately
> this part of pocl has received little attention lately...
>

I don't know what POCL uses for threading, but Intel OpenCL uses the TBB
runtime [6].  The TBB runtime has some very smart features for
load-balancing and automatic cache blocking that are not implemented in
OpenMP and are hard to implement by hand in Pthreads.

[6] https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611

Jeff

-- 
Jeff Hammond
[email protected]
http://jeffhammond.github.io/

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Debugging auto vectorizer

Reply via email to