Re: [pocl-devel] Debugging auto vectorizer

Timo Betcke Wed, 07 Feb 2018 16:24:52 -0800

Hi,

one more hint. I followed Pekka's suggestion to enable debug output in
ImplicitLoopBarriers.cc and
ImplicitConditionalBarriers.cc. There is some interesting output generated.
It states that:


### ILB: The kernel has no barriers, let's not add implicit ones either to
avoid WI context switch overheads
### ILB: The kernel has no barriers, let's not add implicit ones either to
avoid WI context switch overheads
### trying to add a loop barrier to force horizontal parallelization
### the loop is not uniform because loop entry '' is not uniform
### trying to add a loop barrier to force horizontal parallelization
### the loop is not uniform because loop entry '' is not uniform

What does it mean and does it prevent workgroup level parallelization?

Best wishes

Timo

On 7 February 2018 at 23:41, Timo Betcke <[email protected]> wrote:

> Hi,
>
> I have tried to dive a bit more into the code now and used Pekka's and
> Jeffs hints. Analyzing with Vtune showed that no AVX2 code is generated in
> POCL,
> which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to activate
> vectorizer remarks. But it does not create any kind of output. However,
> I could create the llvm generated code using POCL_LEAVE_KERNEL_
> COMPILER_TEMP_FILES=1. I am not experienced with LLVM IR. But it seems
> that
> it does not create vectorized code. I have uploaded a gist with the
> disassembled output here:
>
> https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b
>
> The question is what prevents the auto vectorizer from working at all. The
> code seems quite straight forward with very simple for-loops with
> hard-coded bounds
> (numQuadPoints is a compiler macro, set to 3 in the experiments). I would
> be grateful for any pointer of how to proceed to figure out what is going
> on with the
> vectorizer.
>
> By the way, I have recompiled pocl with llvm 6. There was no change in
> behavior from versions 4 and 5.
>
> Best wishes
>
> Timo
>
> On 7 February 2018 at 16:37, Timo Betcke <[email protected]> wrote:
>
>> Dear Jeff,
>>
>> thanks for the explanations. I have now installed pocl on my Xeon W
>> workstation, and the benchmarks are as follows
>> (pure kernel runtime via event timers this time to exclude Python
>> overhead.)
>>
>> 1.) Intel OpenCL Driver: 0.0965s
>> 2.) POCL: 0.937s
>> 3.) AMD CPU OpenCL Driver: 0.64s
>>
>> The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had time to
>> investigate the LLVM IR Code as suggested
>> but will do as soon as possible. AMD is included as I have a Radeon Pro
>> card, which automatically also installed OpenCL CPU drivers.
>>
>> Best wishes
>>
>> Timo
>>
>>
>> On 7 February 2018 at 16:03, Jeff Hammond <[email protected]> wrote:
>>
>>>
>>>
>>> On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <[email protected]
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> > we noticed for one of our OpenCL kernels that pocl is over 4 times
>>>> > slower than the Intel OpenCL runtime on a Xeon W processor.
>>>>
>>>> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
>>>> is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
>>>> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
>>>> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
>>>> found, i don't have a machine anymore to test it).
>>>>
>>>
>>>
>>> Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of
>>> the Skylake generation, which I'll refer to as SKX since they are
>>> microarchitecturally the same.  All of these support AVX-512, which I'm
>>> going to refer to as AVX3 in the following, for reasons that will become
>>> clear.
>>>
>>> An important detail when evaluating vectorization on these processors is
>>> that the frequency drops when transitioning from scalar/SSE2 code to AVX2
>>> code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm
>>> (128b), ymm (256b), and zmm (512b) registers respectively.  AVX3
>>> instructions with ymm registers should run at AVX2 frequency.
>>>
>>> While most (but not all - see [3]) parts have 2 VPUs, the first of these
>>> is implemented via port fusion [4].  What this means is that the core can
>>> dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3
>>> 256b instructions (AVX2 or AVX3) on ports 0, 1 and 5.  Thus, one can get
>>> 1024b throughput at one frequency or 768b throughput at a slightly higher
>>> frequency.  What this means is that 512b vectorization pays off for code
>>> that is thoroughly compute-bound and heavily vectorized (e.g. dense linear
>>> algebra and molecular dynamics) but that 256b vectorization is likely
>>> better for code that is more memory-bound or doesn't vectorize as well.
>>>
>>> The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to
>>> address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take
>>> advantage of all the AVX3 instructions but favor 256b ymm registers, which
>>> will behave exactly like AVX2 in some cases (i.e. ones where the AVX3
>>> instruction features aren't used).
>>>
>>> Anyways, the short version of this story is that you should not assume
>>> 512b SIMD code generation is the reason for a performance benefit from the
>>> Intel OpenCL compiler, since it may in fact not generate those instructions
>>> if it thinks that 256b is better.  It would be useful to force both POCL
>>> and Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see
>>> how they compare when targeting the same vector ISA.  This sort of
>>> comparison would also be helpful to resolve an older bug report of a
>>> similar nature [5].
>>>
>>> What I wrote here is one engineer's attempt to summarize a large amount
>>> of information in a user-friendly format.  I apologize for any errors -
>>> they are certainly not intentional.
>>>
>>> [1] https://ark.intel.com/products/series/125035/Intel-Xeon-
>>> Processor-W-Family
>>> [2] https://www.intel.com/content/dam/www/public/us/en/docum
>>> ents/specification-updates/xeon-scalable-spec-update.pdf
>>> [3] https://github.com/jeffhammond/vpu-count
>>> [4] https://en.wikichip.org/wiki/intel/microarchitectures/skylak
>>> e_(server)#Scheduler_.26_512-SIMD_addition
>>> [5] https://github.com/pocl/pocl/issues/292
>>>
>>>
>>>> 2) It could be the autovectorizer, or it could be something else. Are
>>>> your machines NUMA ? if so, you'll likely see very bad performance, as
>>>> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
>>>> unrolls too much and overflows L1 caches (you could try experimenting
>>>> with various local WG sizes to clEnqueueNDRK). Unfortunately
>>>> this part of pocl has received little attention lately...
>>>>
>>>
>>> I don't know what POCL uses for threading, but Intel OpenCL uses the TBB
>>> runtime [6].  The TBB runtime has some very smart features for
>>> load-balancing and automatic cache blocking that are not implemented in
>>> OpenMP and are hard to implement by hand in Pthreads.
>>>
>>> [6] https://software.intel.com/en-us/articles/whats-new-open
>>> cl-runtime-1611
>>>
>>> Jeff
>>>
>>> --
>>> Jeff Hammond
>>> [email protected]
>>> http://jeffhammond.github.io/
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> pocl-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>>
>>>
>>
>>
>> --
>> Dr. Timo Betcke
>> Reader in Mathematics
>> University College London
>> Department of Mathematics
>> E-Mail: [email protected]
>> Tel.: +44 (0) 20-3108-4068 <020%203108%204068>
>> Fax.: +44 (0) 20-7383-5519 <020%207383%205519>
>>
>
>
>
> --
> Dr. Timo Betcke
> Reader in Mathematics
> University College London
> Department of Mathematics
> E-Mail: [email protected]
> Tel.: +44 (0) 20-3108-4068 <020%203108%204068>
> Fax.: +44 (0) 20-7383-5519 <020%207383%205519>
>



-- 
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected]
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Debugging auto vectorizer

Reply via email to