Re: [pocl-devel] Debugging auto vectorizer

Pekka Jääskeläinen Wed, 07 Feb 2018 06:37:16 -0800

Dear Timo,

Also Michal's advice of testing with LLVM 6.0 is a good idea. Given
you don't seem to have barriers in your kernel, you might want to check if
any of the implicit barriers we inject confuse the vectorization.


These are inserted at least in passes:

  ImplicitConditionalBarriers.cc
  ImplicitLoopBarriers.cc

They have some debug output which you can enable via a macro
define which can help you to the right direction.

BR,
Pekka

On 02/07/2018 01:34 PM, Timo Betcke wrote:

Dear All,

thanks for the responses. I am posting the kernel below. You can also find it
at

https://bitbucket.org/bemppsolutions/bempp-cl/src/master/bempp/core/opencl/sources/kernels/laplace_bnd_slp_0.cl?at=master&fileviewer=file-view-default
It is the first kernel (evaluate_regular). The second one is lessperformance critical. First of all, regarding processor. I mistyped. Iactually made the experiments on my Kaby Lake i7dual core laptop, but will redo the experiments today on my Xeon Wworkstation. I recompiled yesterday already with llvm 5. But the results aresimilar.Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds forthis setup.
I will follow Jeff's and Pekka's advice today to get some more infomation onthe compiled kernels (might be slow going, as it is the first time I amdiving into OpenCL profiling).
I just also tried setting the workgroup sizes manually. But this did notchange anything in the pocl performance. The actual kernel is below. Itconsists of fairly simple for loops.The parameter REALTYPE is controlled by a macro in the header file and setto 'float' for the benchmark tests.
A little explanation is in order. Each work item takes two triangles ( atest and a trial triangle), computes some geometric properties andintegrates the 3d Laplace potential operator 1/racross them, where r is the distance between a quadrature point on the testand on the trial triangle. So, it is a fairly simple operation thataccelerates very well on the Intel runtime.
I am quite motivated to get this issue sorted. We are rewriting anopen-source software package (Bempp, www.bempp.com <http://www.bempp.com>)to be based on PyOpenCL with OpenCL kernels.The first performance benchmarks gave very good speed-ups compared to ourold C++ code (for both, Intel and pocl). But for later deployment we want tobe able to target pocl with minimialperformance penalties compared to the Intel runtime (if possible) as it isby default available in Ubuntu and can also be simply installed via conda-forge.
Best wishes

Timo


---------------------------------------------------------------------------------------------

__kernel void evaluate_regular(__constant REALTYPE3 *testElements,
                               __constant REALTYPE3 *trialElements,
                               __constant REALTYPE2 *quadPoints,
                               __constant REALTYPE *quadWeights,
                               __global REALTYPE *globalResult,
                               int nTrial)
{

     /* Variable declarations */

     const int numQuadPoints = N_QUAD_POINTS;


     size_t myTestIndex = get_global_id(0);
     size_t myTrialIndex = get_global_id(1);

     size_t trialOffset = get_global_offset(1);


     size_t testQuadIndex;
     size_t trialQuadIndex;
     size_t i;

     REALTYPE3 testGlobalPoint;
     REALTYPE3 trialGlobalPoint;

     REALTYPE3 testCorners[3];
     REALTYPE3 trialCorners[3];

     REALTYPE3 testJac[2];
     REALTYPE3 trialJac[2];

     REALTYPE2 testPoint;
     REALTYPE2 trialPoint;

     REALTYPE dist;
     REALTYPE testIntElem;
     REALTYPE trialIntElem;

     REALTYPE shapeIntegral = 0;

     if (myTrialIndex - trialOffset >= nTrial) return;

     for (i = 0; i < 3; ++i){
         testCorners[i] = testElements[3 * myTestIndex + i];
         trialCorners[i] = trialElements[3 * myTrialIndex + i];
     }

     testJac[0] = testCorners[1] - testCorners[0];
     testJac[1] = testCorners[2] - testCorners[0];

     trialJac[0] = trialCorners[1] - trialCorners[0];
     trialJac[1] = trialCorners[2] - trialCorners[0];

     testIntElem =  length(cross(testJac[0], testJac[1]));
     trialIntElem = length(cross(trialJac[0], trialJac[1]));

     for (testQuadIndex = 0; testQuadIndex < numQuadPoints; ++testQuadIndex){

         testPoint = quadPoints[testQuadIndex];
testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x +testJac[1] * testPoint.y; //testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) *testCorners[0] +
         //    testPoint.x * testCorners[1] + testPoint.y * testCorners[2];
for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints;++trialQuadIndex){
             trialPoint = quadPoints[trialQuadIndex];
trialGlobalPoint = trialCorners[0] + trialJac[0] * trialPoint.x+ trialJac[1] * trialPoint.y; //trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) *trialCorners[0] + // trialPoint.x * trialCorners[1] + trialPoint.y *trialCorners[2];
             dist = distance(testGlobalPoint, trialGlobalPoint);
shapeIntegral += quadWeights[testQuadIndex] *quadWeights[trialQuadIndex] / dist;
         }

     }

     shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI;
globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] =shapeIntegral;
}
On 7 February 2018 at 10:41, Michal Babej <[email protected]<mailto:[email protected]>> wrote:
 >
 > Hi,
 >
 > > we noticed for one of our OpenCL kernels that pocl is over 4 times
 > > slower than the Intel OpenCL runtime on a Xeon W processor.
 >
 > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
 > is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
 > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
 > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
 > found, i don't have a machine anymore to test it).
 >
 > 2) It could be the autovectorizer, or it could be something else. Are
 > your machines NUMA ? if so, you'll likely see very bad performance, as
 > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
 > unrolls too much and overflows L1 caches (you could try experimenting
 > with various local WG sizes to clEnqueueNDRK). Unfortunately
 > this part of pocl has received little attention lately...
 >
 > Cheers,
 > -- mb
 >
>------------------------------------------------------------------------------
 > Check out the vibrant tech community on one of the world's most
 > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 > _______________________________________________
 > pocl-devel mailing list
 > [email protected] <mailto:[email protected]>
 > https://lists.sourceforge.net/lists/listinfo/pocl-devel




--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected] <mailto:[email protected]>
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot



_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel


--
Pekka

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Debugging auto vectorizer

Reply via email to