Sorry. It was still set to private. I have set it to public now.
Timo
On 7 February 2018 at 15:52, Jeff Hammond <[email protected]> wrote:
>
>
> On Wed, Feb 7, 2018 at 3:34 AM, Timo Betcke <[email protected]> wrote:
>
>> Dear All,
>>
>> thanks for the responses. I am posting the kernel below. You can also
>> find it
>> at
>>
>> https://bitbucket.org/bemppsolutions/bempp-cl/src/master/
>> bempp/core/opencl/sources/kernels/laplace_bnd_slp_0.cl?
>> at=master&fileviewer=file-view-default
>>
>>
> I get an "access denied" error.
>
> Jeff
>
>
>> It is the first kernel (evaluate_regular). The second one is less
>> performance critical. First of all, regarding processor. I mistyped. I
>> actually made the experiments on my Kaby Lake i7
>> dual core laptop, but will redo the experiments today on my Xeon W
>> workstation. I recompiled yesterday already with llvm 5. But the results
>> are similar.
>> Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds for
>> this setup.
>>
>> I will follow Jeff's and Pekka's advice today to get some more infomation
>> on the compiled kernels (might be slow going, as it is the first time I am
>> diving into OpenCL profiling).
>>
>> I just also tried setting the workgroup sizes manually. But this did not
>> change anything in the pocl performance. The actual kernel is below. It
>> consists of fairly simple for loops.
>> The parameter REALTYPE is controlled by a macro in the header file and
>> set to 'float' for the benchmark tests.
>>
>> A little explanation is in order. Each work item takes two triangles ( a
>> test and a trial triangle), computes some geometric properties and
>> integrates the 3d Laplace potential operator 1/r
>> across them, where r is the distance between a quadrature point on the
>> test and on the trial triangle. So, it is a fairly simple operation that
>> accelerates very well on the Intel runtime.
>>
>> I am quite motivated to get this issue sorted. We are rewriting an
>> open-source software package (Bempp, www.bempp.com) to be based on
>> PyOpenCL with OpenCL kernels.
>> The first performance benchmarks gave very good speed-ups compared to our
>> old C++ code (for both, Intel and pocl). But for later deployment we want
>> to be able to target pocl with minimial
>> performance penalties compared to the Intel runtime (if possible) as it
>> is by default available in Ubuntu and can also be simply installed via
>> conda-forge.
>>
>> Best wishes
>>
>> Timo
>>
>>
>> ------------------------------------------------------------
>> ---------------------------------
>>
>> __kernel void evaluate_regular(__constant REALTYPE3 *testElements,
>> __constant REALTYPE3 *trialElements,
>> __constant REALTYPE2 *quadPoints,
>> __constant REALTYPE *quadWeights,
>> __global REALTYPE *globalResult,
>> int nTrial)
>> {
>>
>> /* Variable declarations */
>>
>> const int numQuadPoints = N_QUAD_POINTS;
>>
>>
>> size_t myTestIndex = get_global_id(0);
>> size_t myTrialIndex = get_global_id(1);
>>
>> size_t trialOffset = get_global_offset(1);
>>
>>
>> size_t testQuadIndex;
>> size_t trialQuadIndex;
>> size_t i;
>>
>> REALTYPE3 testGlobalPoint;
>> REALTYPE3 trialGlobalPoint;
>>
>> REALTYPE3 testCorners[3];
>> REALTYPE3 trialCorners[3];
>>
>> REALTYPE3 testJac[2];
>> REALTYPE3 trialJac[2];
>>
>> REALTYPE2 testPoint;
>> REALTYPE2 trialPoint;
>>
>> REALTYPE dist;
>> REALTYPE testIntElem;
>> REALTYPE trialIntElem;
>>
>> REALTYPE shapeIntegral = 0;
>>
>> if (myTrialIndex - trialOffset >= nTrial) return;
>>
>> for (i = 0; i < 3; ++i){
>> testCorners[i] = testElements[3 * myTestIndex + i];
>> trialCorners[i] = trialElements[3 * myTrialIndex + i];
>> }
>>
>> testJac[0] = testCorners[1] - testCorners[0];
>> testJac[1] = testCorners[2] - testCorners[0];
>>
>> trialJac[0] = trialCorners[1] - trialCorners[0];
>> trialJac[1] = trialCorners[2] - trialCorners[0];
>>
>> testIntElem = length(cross(testJac[0], testJac[1]));
>> trialIntElem = length(cross(trialJac[0], trialJac[1]));
>>
>> for (testQuadIndex = 0; testQuadIndex < numQuadPoints;
>> ++testQuadIndex){
>>
>> testPoint = quadPoints[testQuadIndex];
>>
>> testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x +
>> testJac[1] * testPoint.y;
>> //testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) *
>> testCorners[0] +
>> // testPoint.x * testCorners[1] + testPoint.y * testCorners[2];
>>
>> for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints;
>> ++trialQuadIndex){
>>
>> trialPoint = quadPoints[trialQuadIndex];
>> trialGlobalPoint = trialCorners[0] + trialJac[0] *
>> trialPoint.x + trialJac[1] * trialPoint.y;
>> //trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) *
>> trialCorners[0] +
>> // trialPoint.x * trialCorners[1] + trialPoint.y *
>> trialCorners[2];
>>
>> dist = distance(testGlobalPoint, trialGlobalPoint);
>> shapeIntegral += quadWeights[testQuadIndex] *
>> quadWeights[trialQuadIndex] / dist;
>> }
>>
>> }
>>
>> shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI;
>> globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] =
>> shapeIntegral;
>>
>>
>>
>>
>> }
>>
>>
>>
>>
>>
>> On 7 February 2018 at 10:41, Michal Babej <[email protected]>
>> wrote:
>> >
>> > Hi,
>> >
>> > > we noticed for one of our OpenCL kernels that pocl is over 4 times
>> > > slower than the Intel OpenCL runtime on a Xeon W processor.
>> >
>> > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
>> > is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
>> > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
>> > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
>> > found, i don't have a machine anymore to test it).
>> >
>> > 2) It could be the autovectorizer, or it could be something else. Are
>> > your machines NUMA ? if so, you'll likely see very bad performance, as
>> > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
>> > unrolls too much and overflows L1 caches (you could try experimenting
>> > with various local WG sizes to clEnqueueNDRK). Unfortunately
>> > this part of pocl has received little attention lately...
>> >
>> > Cheers,
>> > -- mb
>> >
>> > ------------------------------------------------------------
>> ------------------
>> > Check out the vibrant tech community on one of the world's most
>> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> > _______________________________________________
>> > pocl-devel mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>
>>
>>
>>
>> --
>> Dr. Timo Betcke
>> Reader in Mathematics
>> University College London
>> Department of Mathematics
>> E-Mail: [email protected]
>> Tel.: +44 (0) 20-3108-4068 <020%203108%204068>
>> Fax.: +44 (0) 20-7383-5519 <020%207383%205519>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> pocl-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>
>>
>
>
> --
> Jeff Hammond
> [email protected]
> http://jeffhammond.github.io/
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> pocl-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>
>
--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected]
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel