> I think this may lead to optimize for a special input-range.
I agree - my ideea with the benchmark was just looking at how fast is a the
function on an interval.
I've looked at a function - say sinus - and saw that there are 3 paths the code
may take based on the input value. To properly evaluate the performance of each
path I would do x = sin (x + a) then x = x * 0x1p-16, where a is the min value
of an interval, and x is close to 0. So at the end I would know that the
performance is T on (a, b) interval, 2T on (b, c) interval etc.
Now this won't tell me how sinus actually performs since I don't know how often
sinus with values (a, b) is called vs (b, c) or other - but it would tell me
for instance that internal_1 performance (a, b) is 6 times faster than
internal_2 (b, c) and 9 times than internal_3 (c, d) etc..
Coming back to your observation if I run a test (ex Luxmark) and if I see (b,
c) is called most times, and optimize that path and Luxmark doubles in
performance (let's say) but another test Madelbulb which mainly uses sinus on
(c, d) doesn't see any increase, then of course, I just optimized for an
interval and it's wrong to base a general sinus performance evaluation on
Luxmark alone - in this example case.. so I agree with you about your
observation, but I am not doing that, yet anyway :)
> I don't quite understand what do you mean by "reiterating through an interval
> would not offer real world performance"?
> I don't have any good idea, by from my understanding, a large input-value
> range is ok. Any comments?
Say I want to test sinus. As I said above sinus has 3 intervals of path code
all internal, nothing is native. There are 2 reduction methods (normal numbers
and very high numbers) and no reduction on a very low interval. If I reiterate
over the whole interval of numbers (-inf, inf) I actually measure the following
Overall_performance = Performance_interval_1 * (size interval1) +
Performance_interval_2 * (size interval2) + Performance_interval_3 * (size
interval3). And this will give a general view on the function, I agree. But
this doesn't take into account what is the general, reald world use of sinus.
For instance if interval 3 is the largest and offers lowest performance (let's
assume) but is almost never called in tests that use sinus than it wouln't help
to know how the general all path performance is for sinus (because people who
use sinus might know that reduction is a big performance penalty on large
numbers).
I believe the only way to evaluate if a change in math code is relevant is with
real world tests. We thus must have a diverse set of tests that use most math
functions. Ideally one should document what each test uses and in what
proportion. I have starting doing this but it's taking a lot of time due to the
complexity of some tests (e.g. Luxmark).
For instance it would be nice to know Luxmark is impacted "high" {log2}
"normal" {sinus, cosinus, pow, exp} "low" {sqrt, tan}. So that if I optimize
tan I should not expect Luxmark to change much - but if I target log2 than it
should change. Again if you think some math functions have intervals than the
problem is even more difficult since maybe Luxmark only goes to use (a, b)
internval on sinus :)
> I think we should separate the benchmark test from the real implementation.
Yes, agree - these tests are only to help improve the current Beignet math
implementation. They have use in other implementations if one knows the
underlying interval sets.
-------------------------------------------------------------
So I see the following flow of optimization for Beignet - but may apply to any
other math implemention for OpenCL:
1. (done) See performance of each interval for a given function (sin). We would
know perf1 on (a, b), perf2 on (b, c), perf3 on (c, d)
2. (working) Run several relevant math tests (relevant to sinus). Try to
identify in what circumstances is sin called. Maybe all tests call it on (a,b)
and (b,c). Then we should target (a,b) and (b,c) because that is what is being
used. This would assume math tests are well chosen and diverse.
3. (working) Optimize intervals (a, b) and (b, c). Observe how each optimized
since we can test performance on intervals. Re-run real world math tests.
Any thoughts on this ?
I did some optimizations (call to native and polynomial reduction) and obtained
an increase of at least 5% in about 8 - 10 math tests from the ones provided by
Mengmeng. It's quite difficult to target the general case for all math
functions but I think these changes are relevant to some point.
-----Original Message-----
From: Song, Ruiling
Sent: Tuesday, May 3, 2016 2:07 PM
To: Lupescu, Grigore <[email protected]>; [email protected]
Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on
intervals
> -----Original Message-----
> From: Lupescu, Grigore
> Sent: Monday, May 2, 2016 12:32 PM
> To: Song, Ruiling <[email protected]>;
> [email protected]
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math
> performance on intervals
>
> Regarding the first question - For math functions I made the
> benchmarks to evaluate the gaps of performance between native and different
> paths of internal.
> So I would understand where should I maybe focus on optimization.
I think this may lead to optimize for a special input-range.
But optimizing for a special input range may be harmful unless the input
NORMALLY lies in that range on GPU.
If the input data is in different range, the runtime instruction count will be
increased.
I think we should try to optimize for wider input range, minimize if-else check.
>
> I never meant to make a general all purpose benchmark for any driver -
> I find that quite difficult since I don't think just reiterating
> through an interval would offer real world performance. If you have
> any ideas here though, would be great :)
I don't quite understand what do you mean by "reiterating through an interval
would not offer real world performance"?
I think benchmark using a large input-value range is just enough when doing
comparison with native_version or with other opencl implementation.
I don't have any good idea, by from my understanding, a large input-value range
is ok. Any comments?
>
> -----Original Message-----
> From: Song, Ruiling
> Sent: Monday, May 2, 2016 5:10 AM
> To: Lupescu, Grigore <[email protected]>;
> [email protected]
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math
> performance on intervals
>
>
>
> > -----Original Message-----
> > From: Beignet [mailto:[email protected]] On
> > Behalf Of Grigore Lupescu
> > Sent: Monday, May 2, 2016 3:04 AM
> > To: [email protected]
> > Subject: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance
> > on intervals
> >
> > From: Grigore Lupescu <grigore.lupescu at intel.com>
> >
> > Functions to benchmark math functions on intervals.
> > Tests: sin, cos, exp2, exp, exp10, log2, log, log10
> >
> > Signed-off-by: Grigore Lupescu <grigore.lupescu at intel.com>
> > ---
> > benchmark/CMakeLists.txt | 3 +-
> > benchmark/benchmark_math.cpp | 126 ++++++++++++++++++++
> > kernels/bench_math.cl | 272
> > +++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 400 insertions(+), 1 deletion(-) create mode
> > 100644 benchmark/benchmark_math.cpp create mode 100644
> kernels/bench_math.cl
> >
> > diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt
> > index
> > dd33829..4c3c933 100644
> > --- a/benchmark/CMakeLists.txt
> > +++ b/benchmark/CMakeLists.txt
> > @@ -18,7 +18,8 @@ set (benchmark_sources
> > benchmark_copy_buffer_to_image.cpp
> > benchmark_copy_image_to_buffer.cpp
> > benchmark_copy_buffer.cpp
> > - benchmark_copy_image.cpp)
> > + benchmark_copy_image.cpp
> > + benchmark_math.cpp)
> >
> > +/* calls internal fast (native) if (x > -0x1.6p1 && x < 0x1.6p1) */
> > +kernel void bench_math_exp(
> > + global float *src,
> > + global float *dst,
> > + float pwr,
> > + uint loop)
> > +{
> > + float result = src[get_global_id(0)];
> > +
> > + for(; loop > 0; loop--)
> > + {
> > +#if defined(BENCHMARK_NATIVE)
> > + result = native_exp(-0x1.6p1 - result); /* calls native */
> > +#elif
> > +defined(BENCHMARK_INTERNAL_FAST)
> > + result = exp(-0x1.6p1 + result); /* calls internal fast */ #else
> > + result = exp(-0x1.6p1 - result); /* calls internal slow */
> > +#endif
>
> I think we should separate the benchmark test from the real implementation.
> Then we can make easy comparison with other driver implementation and
> Also the implementation in Beignet may change in the future.
> What's your idea on this?
>
> > + }
> > +
> > + dst[get_global_id(0)] = result;
> > +}
> > +
>
> > +/* benchmark sin performance */
> > +kernel void bench_math_sin(
> > + global float *src,
> > + global float *dst,
> > + float pwr,
> > + uint loop)
> > +{
> > + float result = src[get_global_id(0)];
> > +
> > + for(; loop > 0; loop--)
> > + {
> > +#if defined(BENCHMARK_NATIVE)
> > + result = native_sin(result); /* calls native */ #else
> > + result = sin(result); /* calls internal, random complexity */
>
> What's the range of 'result'? Seems very small? I think we need to
> make sure the input argument to sin() in a large range.
> As we need try to optimize for general case.
>
> Thanks!
> Ruiling
> > + //result = sin(0.1f + result); /* calls internal, (1) no reduction */
> > + //result = sin(2.f + result); /* calls internal, (2) fast reduction */
> > + //result = sin(4001 + result); /* calls internal, (3) slow reduction */
> > + result *= 0x1p-16;
> > +#endif
> > + }
> > +
> > + dst[get_global_id(0)] = result;
> > +}
> > +
_______________________________________________
Beignet mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/beignet