Re: [pocl-devel] Debugging auto vectorizer

Jeff Hammond Wed, 07 Feb 2018 07:52:45 -0800

On Wed, Feb 7, 2018 at 3:34 AM, Timo Betcke <[email protected]> wrote:


> Dear All,
>
> thanks for the responses. I am posting the kernel below. You can also find
> it
> at
>
> https://bitbucket.org/bemppsolutions/bempp-cl/src/
> master/bempp/core/opencl/sources/kernels/laplace_bnd_
> slp_0.cl?at=master&fileviewer=file-view-default
>
>
I get an "access denied" error.

Jeff


> It is the first kernel (evaluate_regular). The second one is less
> performance critical. First of all, regarding processor. I mistyped. I
> actually made the experiments on my Kaby Lake i7
> dual core laptop, but will redo the experiments today on my Xeon W
> workstation. I recompiled yesterday already with llvm 5. But the results
> are similar.
> Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds for
> this setup.
>
> I will follow Jeff's and Pekka's advice today to get some more infomation
> on the compiled kernels (might be slow going, as it is the first time I am
> diving into OpenCL profiling).
>
> I just also tried setting the workgroup sizes manually. But this did not
> change anything in the pocl performance. The actual kernel is below. It
> consists of fairly simple for loops.
> The parameter REALTYPE is controlled by a macro in the header file and set
> to 'float' for the benchmark tests.
>
> A little explanation is in order. Each work item takes two triangles ( a
> test and a trial triangle), computes some geometric properties and
> integrates the 3d Laplace potential operator 1/r
> across them, where r is the distance between a quadrature point on the
> test and on the trial triangle. So, it is a fairly simple operation that
> accelerates very well on the Intel runtime.
>
> I am quite motivated to get this issue sorted. We are rewriting an
> open-source software package (Bempp, www.bempp.com) to be based on
> PyOpenCL with OpenCL kernels.
> The first performance benchmarks gave very good speed-ups compared to our
> old C++ code (for both, Intel and pocl). But for later deployment we want
> to be able to target pocl with minimial
> performance penalties compared to the Intel runtime (if possible) as it is
> by default available in Ubuntu and can also be simply installed via
> conda-forge.
>
> Best wishes
>
> Timo
>
>
> ------------------------------------------------------------
> ---------------------------------
>
> __kernel void evaluate_regular(__constant REALTYPE3 *testElements,
>                               __constant REALTYPE3 *trialElements,
>                               __constant REALTYPE2 *quadPoints,
>                               __constant REALTYPE *quadWeights,
>                               __global REALTYPE *globalResult,
>                               int nTrial)
> {
>
>     /* Variable declarations */
>
>     const int numQuadPoints = N_QUAD_POINTS;
>
>
>     size_t myTestIndex = get_global_id(0);
>     size_t myTrialIndex = get_global_id(1);
>
>     size_t trialOffset = get_global_offset(1);
>
>
>     size_t testQuadIndex;
>     size_t trialQuadIndex;
>     size_t i;
>
>     REALTYPE3 testGlobalPoint;
>     REALTYPE3 trialGlobalPoint;
>
>     REALTYPE3 testCorners[3];
>     REALTYPE3 trialCorners[3];
>
>     REALTYPE3 testJac[2];
>     REALTYPE3 trialJac[2];
>
>     REALTYPE2 testPoint;
>     REALTYPE2 trialPoint;
>
>     REALTYPE dist;
>     REALTYPE testIntElem;
>     REALTYPE trialIntElem;
>
>     REALTYPE shapeIntegral = 0;
>
>     if (myTrialIndex - trialOffset >= nTrial) return;
>
>     for (i = 0; i < 3; ++i){
>         testCorners[i] = testElements[3 * myTestIndex + i];
>         trialCorners[i] = trialElements[3 * myTrialIndex + i];
>     }
>
>     testJac[0] = testCorners[1] - testCorners[0];
>     testJac[1] = testCorners[2] - testCorners[0];
>
>     trialJac[0] = trialCorners[1] - trialCorners[0];
>     trialJac[1] = trialCorners[2] - trialCorners[0];
>
>     testIntElem =  length(cross(testJac[0], testJac[1]));
>     trialIntElem = length(cross(trialJac[0], trialJac[1]));
>
>     for (testQuadIndex = 0; testQuadIndex < numQuadPoints;
> ++testQuadIndex){
>
>         testPoint = quadPoints[testQuadIndex];
>
>         testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x +
> testJac[1] * testPoint.y;
>         //testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) *
> testCorners[0] +
>         //    testPoint.x * testCorners[1] + testPoint.y * testCorners[2];
>
>         for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints;
> ++trialQuadIndex){
>
>             trialPoint = quadPoints[trialQuadIndex];
>             trialGlobalPoint = trialCorners[0] + trialJac[0] *
> trialPoint.x + trialJac[1] * trialPoint.y;
>             //trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) *
> trialCorners[0] +
>             //    trialPoint.x * trialCorners[1] + trialPoint.y *
> trialCorners[2];
>
>             dist = distance(testGlobalPoint, trialGlobalPoint);
>             shapeIntegral += quadWeights[testQuadIndex] *
> quadWeights[trialQuadIndex] / dist;
>         }
>
>     }
>
>     shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI;
>     globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] =
> shapeIntegral;
>
>
>
>
> }
>
>
>
>
>
> On 7 February 2018 at 10:41, Michal Babej <[email protected]>
> wrote:
> >
> > Hi,
> >
> > > we noticed for one of our OpenCL kernels that pocl is over 4 times
> > > slower than the Intel OpenCL runtime on a Xeon W processor.
> >
> > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
> > is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
> > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
> > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
> > found, i don't have a machine anymore to test it).
> >
> > 2) It could be the autovectorizer, or it could be something else. Are
> > your machines NUMA ? if so, you'll likely see very bad performance, as
> > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
> > unrolls too much and overflows L1 caches (you could try experimenting
> > with various local WG sizes to clEnqueueNDRK). Unfortunately
> > this part of pocl has received little attention lately...
> >
> > Cheers,
> > -- mb
> >
> > ------------------------------------------------------------
> ------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > _______________________________________________
> > pocl-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/pocl-devel
>
>
>
>
> --
> Dr. Timo Betcke
> Reader in Mathematics
> University College London
> Department of Mathematics
> E-Mail: [email protected]
> Tel.: +44 (0) 20-3108-4068
> Fax.: +44 (0) 20-7383-5519
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> pocl-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>
>


-- 
Jeff Hammond
[email protected]
http://jeffhammond.github.io/

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Debugging auto vectorizer

Reply via email to