Dear Jeff,
thanks for the explanations. I have now installed pocl on my Xeon W
workstation, and the benchmarks are as follows
(pure kernel runtime via event timers this time to exclude Python overhead.)
1.) Intel OpenCL Driver: 0.0965s
2.) POCL: 0.937s
3.) AMD CPU OpenCL Driver: 0.64s
The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had time to
investigate the LLVM IR Code as suggested
but will do as soon as possible. AMD is included as I have a Radeon Pro
card, which automatically also installed OpenCL CPU drivers.
Best wishes
Timo
On 7 February 2018 at 16:03, Jeff Hammond <[email protected]> wrote:
>
>
> On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <[email protected]>
> wrote:
>
>> Hi,
>>
>> > we noticed for one of our OpenCL kernels that pocl is over 4 times
>> > slower than the Intel OpenCL runtime on a Xeon W processor.
>>
>> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
>> is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
>> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
>> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
>> found, i don't have a machine anymore to test it).
>>
>
>
> Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of the
> Skylake generation, which I'll refer to as SKX since they are
> microarchitecturally the same. All of these support AVX-512, which I'm
> going to refer to as AVX3 in the following, for reasons that will become
> clear.
>
> An important detail when evaluating vectorization on these processors is
> that the frequency drops when transitioning from scalar/SSE2 code to AVX2
> code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm
> (128b), ymm (256b), and zmm (512b) registers respectively. AVX3
> instructions with ymm registers should run at AVX2 frequency.
>
> While most (but not all - see [3]) parts have 2 VPUs, the first of these
> is implemented via port fusion [4]. What this means is that the core can
> dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3
> 256b instructions (AVX2 or AVX3) on ports 0, 1 and 5. Thus, one can get
> 1024b throughput at one frequency or 768b throughput at a slightly higher
> frequency. What this means is that 512b vectorization pays off for code
> that is thoroughly compute-bound and heavily vectorized (e.g. dense linear
> algebra and molecular dynamics) but that 256b vectorization is likely
> better for code that is more memory-bound or doesn't vectorize as well.
>
> The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to address
> this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take advantage
> of all the AVX3 instructions but favor 256b ymm registers, which will
> behave exactly like AVX2 in some cases (i.e. ones where the AVX3
> instruction features aren't used).
>
> Anyways, the short version of this story is that you should not assume
> 512b SIMD code generation is the reason for a performance benefit from the
> Intel OpenCL compiler, since it may in fact not generate those instructions
> if it thinks that 256b is better. It would be useful to force both POCL
> and Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see
> how they compare when targeting the same vector ISA. This sort of
> comparison would also be helpful to resolve an older bug report of a
> similar nature [5].
>
> What I wrote here is one engineer's attempt to summarize a large amount of
> information in a user-friendly format. I apologize for any errors - they
> are certainly not intentional.
>
> [1] https://ark.intel.com/products/series/125035/Intel-
> Xeon-Processor-W-Family
> [2] https://www.intel.com/content/dam/www/public/us/en/
> documents/specification-updates/xeon-scalable-spec-update.pdf
> [3] https://github.com/jeffhammond/vpu-count
> [4] https://en.wikichip.org/wiki/intel/microarchitectures/
> skylake_(server)#Scheduler_.26_512-SIMD_addition
> [5] https://github.com/pocl/pocl/issues/292
>
>
>> 2) It could be the autovectorizer, or it could be something else. Are
>> your machines NUMA ? if so, you'll likely see very bad performance, as
>> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
>> unrolls too much and overflows L1 caches (you could try experimenting
>> with various local WG sizes to clEnqueueNDRK). Unfortunately
>> this part of pocl has received little attention lately...
>>
>
> I don't know what POCL uses for threading, but Intel OpenCL uses the TBB
> runtime [6]. The TBB runtime has some very smart features for
> load-balancing and automatic cache blocking that are not implemented in
> OpenMP and are hard to implement by hand in Pthreads.
>
> [6] https://software.intel.com/en-us/articles/whats-new-
> opencl-runtime-1611
>
> Jeff
>
> --
> Jeff Hammond
> [email protected]
> http://jeffhammond.github.io/
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> pocl-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>
>
--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected]
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel