Dear Timo,
Also Michal's advice of testing with LLVM 6.0 is a good idea. Given
you don't seem to have barriers in your kernel, you might want to check if
any of the implicit barriers we inject confuse the vectorization.
These are inserted at least in passes:
ImplicitConditionalBarriers.cc
ImplicitLoopBarriers.cc
They have some debug output which you can enable via a macro
define which can help you to the right direction.
BR,
Pekka
On 02/07/2018 01:34 PM, Timo Betcke wrote:
Dear All,
thanks for the responses. I am posting the kernel below. You can also find it
at
https://bitbucket.org/bemppsolutions/bempp-cl/src/master/bempp/core/opencl/sources/kernels/laplace_bnd_slp_0.cl?at=master&fileviewer=file-view-default
It is the first kernel (evaluate_regular). The second one is less
performance critical. First of all, regarding processor. I mistyped. I
actually made the experiments on my Kaby Lake i7
dual core laptop, but will redo the experiments today on my Xeon W
workstation. I recompiled yesterday already with llvm 5. But the results are
similar.
Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds for
this setup.
I will follow Jeff's and Pekka's advice today to get some more infomation on
the compiled kernels (might be slow going, as it is the first time I am
diving into OpenCL profiling).
I just also tried setting the workgroup sizes manually. But this did not
change anything in the pocl performance. The actual kernel is below. It
consists of fairly simple for loops.
The parameter REALTYPE is controlled by a macro in the header file and set
to 'float' for the benchmark tests.
A little explanation is in order. Each work item takes two triangles ( a
test and a trial triangle), computes some geometric properties and
integrates the 3d Laplace potential operator 1/r
across them, where r is the distance between a quadrature point on the test
and on the trial triangle. So, it is a fairly simple operation that
accelerates very well on the Intel runtime.
I am quite motivated to get this issue sorted. We are rewriting an
open-source software package (Bempp, www.bempp.com <http://www.bempp.com>)
to be based on PyOpenCL with OpenCL kernels.
The first performance benchmarks gave very good speed-ups compared to our
old C++ code (for both, Intel and pocl). But for later deployment we want to
be able to target pocl with minimial
performance penalties compared to the Intel runtime (if possible) as it is
by default available in Ubuntu and can also be simply installed via conda-forge.
Best wishes
Timo
---------------------------------------------------------------------------------------------
__kernel void evaluate_regular(__constant REALTYPE3 *testElements,
__constant REALTYPE3 *trialElements,
__constant REALTYPE2 *quadPoints,
__constant REALTYPE *quadWeights,
__global REALTYPE *globalResult,
int nTrial)
{
/* Variable declarations */
const int numQuadPoints = N_QUAD_POINTS;
size_t myTestIndex = get_global_id(0);
size_t myTrialIndex = get_global_id(1);
size_t trialOffset = get_global_offset(1);
size_t testQuadIndex;
size_t trialQuadIndex;
size_t i;
REALTYPE3 testGlobalPoint;
REALTYPE3 trialGlobalPoint;
REALTYPE3 testCorners[3];
REALTYPE3 trialCorners[3];
REALTYPE3 testJac[2];
REALTYPE3 trialJac[2];
REALTYPE2 testPoint;
REALTYPE2 trialPoint;
REALTYPE dist;
REALTYPE testIntElem;
REALTYPE trialIntElem;
REALTYPE shapeIntegral = 0;
if (myTrialIndex - trialOffset >= nTrial) return;
for (i = 0; i < 3; ++i){
testCorners[i] = testElements[3 * myTestIndex + i];
trialCorners[i] = trialElements[3 * myTrialIndex + i];
}
testJac[0] = testCorners[1] - testCorners[0];
testJac[1] = testCorners[2] - testCorners[0];
trialJac[0] = trialCorners[1] - trialCorners[0];
trialJac[1] = trialCorners[2] - trialCorners[0];
testIntElem = length(cross(testJac[0], testJac[1]));
trialIntElem = length(cross(trialJac[0], trialJac[1]));
for (testQuadIndex = 0; testQuadIndex < numQuadPoints; ++testQuadIndex){
testPoint = quadPoints[testQuadIndex];
testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x +
testJac[1] * testPoint.y;
//testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) *
testCorners[0] +
// testPoint.x * testCorners[1] + testPoint.y * testCorners[2];
for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints;
++trialQuadIndex){
trialPoint = quadPoints[trialQuadIndex];
trialGlobalPoint = trialCorners[0] + trialJac[0] * trialPoint.x
+ trialJac[1] * trialPoint.y;
//trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) *
trialCorners[0] +
// trialPoint.x * trialCorners[1] + trialPoint.y *
trialCorners[2];
dist = distance(testGlobalPoint, trialGlobalPoint);
shapeIntegral += quadWeights[testQuadIndex] *
quadWeights[trialQuadIndex] / dist;
}
}
shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI;
globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] =
shapeIntegral;
}
On 7 February 2018 at 10:41, Michal Babej <[email protected]
<mailto:[email protected]>> wrote:
>
> Hi,
>
> > we noticed for one of our OpenCL kernels that pocl is over 4 times
> > slower than the Intel OpenCL runtime on a Xeon W processor.
>
> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime
> is likely fully using. LLVM 4 has absolutely horrible AVX512 support,
> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for
> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i
> found, i don't have a machine anymore to test it).
>
> 2) It could be the autovectorizer, or it could be something else. Are
> your machines NUMA ? if so, you'll likely see very bad performance, as
> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl
> unrolls too much and overflows L1 caches (you could try experimenting
> with various local WG sizes to clEnqueueNDRK). Unfortunately
> this part of pocl has received little attention lately...
>
> Cheers,
> -- mb
>
>
------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> pocl-devel mailing list
> [email protected] <mailto:[email protected]>
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
--
Dr. Timo Betcke
Reader in Mathematics
University College London
Department of Mathematics
E-Mail: [email protected] <mailto:[email protected]>
Tel.: +44 (0) 20-3108-4068
Fax.: +44 (0) 20-7383-5519
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel
--
Pekka
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel