Re: [pocl-devel] Debugging auto vectorizer

Timo Betcke Wed, 07 Feb 2018 15:41:33 -0800

Hi,

I have tried to dive a bit more into the code now and used Pekka's and
Jeffs hints. Analyzing with Vtune showed that no AVX2 code is generated in
POCL,
which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to activate
vectorizer remarks. But it does not create any kind of output. However,
I could create the llvm generated code using
POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1.
I am not experienced with LLVM IR. But it seems that
it does not create vectorized code. I have uploaded a gist with the
disassembled output here:


https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b

The question is what prevents the auto vectorizer from working at all. The
code seems quite straight forward with very simple for-loops with
hard-coded bounds
(numQuadPoints is a compiler macro, set to 3 in the experiments). I would
be grateful for any pointer of how to proceed to figure out what is going
on with the
vectorizer.

By the way, I have recompiled pocl with llvm 6. There was no change in
behavior from versions 4 and 5.

Best wishes

Timo

On 7 February 2018 at 16:37, Timo Betcke <[email protected]> wrote:

> Dear Jeff,
>
> thanks for the explanations. I have now installed pocl on my Xeon W
> workstation, and the benchmarks are as follows
> (pure kernel runtime via event timers this time to exclude Python
> overhead.)
>
> 1.) Intel OpenCL Driver: 0.0965s
> 2.) POCL: 0.937s
> 3.) AMD CPU OpenCL Driver: 0.64s
>
> The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had time to
> investigate the LLVM IR Code as suggested
> but will do as soon as possible. AMD is included as I have a Radeon Pro
> card, which automatically also installed OpenCL CPU drivers.
>
> Best wishes
>
> Timo
>
>
> On 7 February 2018 at 16:03, Jeff Hammond <[email protected]> wrote:
>
>>
>>
>> On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> > we noticed for one of our OpenCL kernels that pocl is over 4 times
>>> > slower than the Intel OpenCL runtime on a Xeon W processor.
>>>
>>> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
>>> is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
>>> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
>>> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
>>> found, i don't have a machine anymore to test it).
>>>
>>
>>
>> Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of the
>> Skylake generation, which I'll refer to as SKX since they are
>> microarchitecturally the same.  All of these support AVX-512, which I'm
>> going to refer to as AVX3 in the following, for reasons that will become
>> clear.
>>
>> An important detail when evaluating vectorization on these processors is
>> that the frequency drops when transitioning from scalar/SSE2 code to AVX2
>> code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm
>> (128b), ymm (256b), and zmm (512b) registers respectively.  AVX3
>> instructions with ymm registers should run at AVX2 frequency.
>>
>> While most (but not all - see [3]) parts have 2 VPUs, the first of these
>> is implemented via port fusion [4].  What this means is that the core can
>> dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3
>> 256b instructions (AVX2 or AVX3) on ports 0, 1 and 5.  Thus, one can get
>> 1024b throughput at one frequency or 768b throughput at a slightly higher
>> frequency.  What this means is that 512b vectorization pays off for code
>> that is thoroughly compute-bound and heavily vectorized (e.g. dense linear
>> algebra and molecular dynamics) but that 256b vectorization is likely
>> better for code that is more memory-bound or doesn't vectorize as well.
>>
>> The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to address
>> this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take advantage
>> of all the AVX3 instructions but favor 256b ymm registers, which will
>> behave exactly like AVX2 in some cases (i.e. ones where the AVX3
>> instruction features aren't used).
>>
>> Anyways, the short version of this story is that you should not assume
>> 512b SIMD code generation is the reason for a performance benefit from the
>> Intel OpenCL compiler, since it may in fact not generate those instructions
>> if it thinks that 256b is better.  It would be useful to force both POCL
>> and Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see
>> how they compare when targeting the same vector ISA.  This sort of
>> comparison would also be helpful to resolve an older bug report of a
>> similar nature [5].
>>
>> What I wrote here is one engineer's attempt to summarize a large amount
>> of information in a user-friendly format.  I apologize for any errors -
>> they are certainly not intentional.
>>
>> [1] https://ark.intel.com/products/series/125035/Intel-Xeon-
>> Processor-W-Family
>> [2] https://www.intel.com/content/dam/www/public/us/en/docum
>> ents/specification-updates/xeon-scalable-spec-update.pdf
>> [3] https://github.com/jeffhammond/vpu-count
>> [4] https://en.wikichip.org/wiki/intel/microarchitectures/skylak
>> e_(server)#Scheduler_.26_512-SIMD_addition
>> [5] https://github.com/pocl/pocl/issues/292
>>
>>
>>> 2) It could be the autovectorizer, or it could be something else. Are
>>> your machines NUMA ? if so, you'll likely see very bad performance, as
>>> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
>>> unrolls too much and overflows L1 caches (you could try experimenting
>>> with various local WG sizes to clEnqueueNDRK). Unfortunately
>>> this part of pocl has received little attention lately...
>>>
>>
>> I don't know what POCL uses for threading, but Intel OpenCL uses the TBB
>> runtime [6].  The TBB runtime has some very smart features for
>> load-balancing and automatic cache blocking that are not implemented in
>> OpenMP and are hard to implement by hand in Pthreads.
>>
>> [6] https://software.intel.com/en-us/articles/whats-new-open
>> cl-runtime-1611
>>
>> Jeff
>>
>> --
>> Jeff Hammond
>> [email protected]
>> http://jeffhammond.github.io/
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> pocl-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>
>>
>
>
> --
> Dr. Timo Betcke
> Reader in Mathematics
> University College London
> Department of Mathematics
> E-Mail: [email protected]
> Tel.: +44 (0) 20-3108-4068 <020%203108%204068>
> Fax.: +44 (0) 20-7383-5519 <020%207383%205519>
>



-- 
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected]
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Debugging auto vectorizer

Reply via email to