Re: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on intervals

Lupescu, Grigore Wed, 04 May 2016 00:06:19 -0700

> I think this may lead to optimize for a special input-range.

I agree - my ideea with the benchmark was just looking at how fast is a the 
function on an interval. 
I've looked at a function - say sinus - and saw that there are 3 paths the code 
may take based on the input value. To properly evaluate the performance of each 
path I would do x = sin (x + a) then x = x * 0x1p-16, where a is the min value 
of an interval, and x is close to 0. So at the end I would know that the 
performance is T on (a, b) interval, 2T on (b, c) interval etc. 
Now this won't tell me how sinus actually performs since I don't know how often 
sinus with values (a, b) is called vs (b, c) or other - but it would tell me 
for instance that internal_1 performance (a, b) is 6 times faster than 
internal_2 (b, c) and 9 times than internal_3 (c, d) etc..


Coming back to your observation if I run a test (ex Luxmark) and if I see (b, 
c) is called most times, and optimize that path and Luxmark doubles in 
performance (let's say) but another test Madelbulb which mainly uses sinus on 
(c, d) doesn't see any increase, then of course, I just optimized for an 
interval and it's wrong to base a general sinus performance evaluation on 
Luxmark alone - in this example case.. so I agree with you about your 
observation, but I am not doing that, yet anyway :)

> I don't quite understand what do you mean by "reiterating through an interval 
> would not offer real world performance"?
> I don't have any good idea, by from my understanding, a large input-value 
> range is ok. Any comments?

Say I want to test sinus. As I said above sinus has 3 intervals of path code 
all internal, nothing is native. There are 2 reduction methods (normal numbers 
and very high numbers) and no reduction on a very low interval. If I reiterate 
over the whole interval of numbers (-inf, inf) I actually measure the following
Overall_performance = Performance_interval_1 * (size interval1) + 
Performance_interval_2 * (size interval2) + Performance_interval_3 * (size 
interval3). And this will give a general view on the function, I agree. But 
this doesn't take into account what is the general, reald world use of sinus. 

For instance if interval 3 is the largest and offers lowest performance (let's 
assume) but is almost never called in tests that use sinus than it wouln't help 
to know how the general all path performance is for sinus (because people who 
use sinus might know that reduction is a big performance penalty on large 
numbers).

I believe the only way to evaluate if a change in math code is relevant is with 
real world tests. We thus must have a diverse set of tests that use most math 
functions. Ideally one should document what each test uses and in what 
proportion. I have starting doing this but it's taking a lot of time due to the 
complexity of some tests (e.g. Luxmark).

For instance it would be nice to know Luxmark is impacted "high" {log2} 
"normal" {sinus, cosinus, pow, exp} "low" {sqrt, tan}. So that if I optimize 
tan I should not expect Luxmark to change much - but if I target log2 than it 
should change. Again if you think some math functions have intervals than the 
problem is even more difficult since maybe Luxmark only goes to use (a, b) 
internval on sinus :)

> I think we should separate the benchmark test from the real implementation.

Yes, agree - these tests are only to help improve the current Beignet math 
implementation. They have use in other implementations if one knows the 
underlying interval sets.

-------------------------------------------------------------

So I see the following flow of optimization for Beignet - but may apply to any 
other math implemention for OpenCL:

1. (done) See performance of each interval for a given function (sin). We would 
know perf1 on (a, b), perf2 on (b, c), perf3 on (c, d)
2. (working) Run several relevant math tests (relevant to sinus). Try to 
identify in what circumstances is sin called. Maybe all tests call it on (a,b) 
and (b,c). Then we should target (a,b) and (b,c) because that is what is being 
used. This would assume math tests are well chosen and diverse.
3. (working) Optimize intervals (a, b) and (b, c). Observe how each optimized 
since we can test performance on intervals. Re-run real world math tests.
Any thoughts on this ?

I did some optimizations (call to native and polynomial reduction) and obtained 
an increase of at least 5% in about 8 - 10 math tests from the ones provided by 
Mengmeng. It's quite difficult to target the general case for all math 
functions but I think these changes are relevant to some point. 

-----Original Message-----
From: Song, Ruiling 
Sent: Tuesday, May 3, 2016 2:07 PM
To: Lupescu, Grigore <[email protected]>; [email protected]
Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on 
intervals



> -----Original Message-----
> From: Lupescu, Grigore
> Sent: Monday, May 2, 2016 12:32 PM
> To: Song, Ruiling <[email protected]>; 
> [email protected]
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math 
> performance on intervals
> 
> Regarding the first question - For math functions I made the 
> benchmarks to evaluate the gaps of performance between native and different 
> paths of internal.
> So I would understand where should I maybe focus on optimization.
I think this may lead to optimize for a special input-range.
But optimizing for a special input range may be harmful unless the input 
NORMALLY lies in that range on GPU.
If the input data is in different range, the runtime instruction count will be 
increased.
I think we should try to optimize for wider input range, minimize if-else check.
> 
> I never meant to make a general all purpose benchmark for any driver - 
> I find that quite difficult since I don't think just reiterating 
> through an interval would offer real world performance. If you have 
> any ideas here though, would be great :)
I don't quite understand what do you mean by "reiterating through an interval 
would not offer real world performance"?
I think benchmark using a large input-value range is just enough when doing 
comparison with native_version or with other opencl implementation.
I don't have any good idea, by from my understanding, a large input-value range 
is ok. Any comments?

> 
> -----Original Message-----
> From: Song, Ruiling
> Sent: Monday, May 2, 2016 5:10 AM
> To: Lupescu, Grigore <[email protected]>; 
> [email protected]
> Subject: RE: [Beignet] [PATCH 1/3] Benchmark: Evaluate math 
> performance on intervals
> 
> 
> 
> > -----Original Message-----
> > From: Beignet [mailto:[email protected]] On 
> > Behalf Of Grigore Lupescu
> > Sent: Monday, May 2, 2016 3:04 AM
> > To: [email protected]
> > Subject: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance 
> > on intervals
> >
> > From: Grigore Lupescu <grigore.lupescu at intel.com>
> >
> > Functions to benchmark math functions on intervals.
> > Tests: sin, cos, exp2, exp, exp10, log2, log, log10
> >
> > Signed-off-by: Grigore Lupescu <grigore.lupescu at intel.com>
> > ---
> >  benchmark/CMakeLists.txt     |   3 +-
> >  benchmark/benchmark_math.cpp | 126 ++++++++++++++++++++
> >  kernels/bench_math.cl        | 272
> > +++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 400 insertions(+), 1 deletion(-)  create mode 
> > 100644 benchmark/benchmark_math.cpp  create mode 100644
> kernels/bench_math.cl
> >
> > diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt 
> > index
> > dd33829..4c3c933 100644
> > --- a/benchmark/CMakeLists.txt
> > +++ b/benchmark/CMakeLists.txt
> > @@ -18,7 +18,8 @@ set (benchmark_sources
> >    benchmark_copy_buffer_to_image.cpp
> >    benchmark_copy_image_to_buffer.cpp
> >    benchmark_copy_buffer.cpp
> > -  benchmark_copy_image.cpp)
> > +  benchmark_copy_image.cpp
> > +  benchmark_math.cpp)
> >
> > +/* calls internal fast (native) if (x > -0x1.6p1 && x < 0x1.6p1) */ 
> > +kernel void bench_math_exp(
> > +  global float *src,
> > +  global float *dst,
> > +  float pwr,
> > +  uint loop)
> > +{
> > +  float result = src[get_global_id(0)];
> > +
> > +  for(; loop > 0; loop--)
> > +  {
> > +#if defined(BENCHMARK_NATIVE)
> > +    result = native_exp(-0x1.6p1 - result); /* calls native */ 
> > +#elif
> > +defined(BENCHMARK_INTERNAL_FAST)
> > +    result = exp(-0x1.6p1 + result); /* calls internal fast */ #else
> > +    result = exp(-0x1.6p1 - result); /* calls internal slow */ 
> > +#endif
> 
> I think we should separate the benchmark test from the real implementation.
> Then we can make easy comparison with other driver implementation and 
> Also the implementation in Beignet may change in the future.
> What's your idea on this?
> 
> > +  }
> > +
> > +  dst[get_global_id(0)] = result;
> > +}
> > +
> 
> > +/* benchmark sin performance */
> > +kernel void bench_math_sin(
> > +  global float *src,
> > +  global float *dst,
> > +  float pwr,
> > +  uint loop)
> > +{
> > +  float result = src[get_global_id(0)];
> > +
> > +  for(; loop > 0; loop--)
> > +  {
> > +#if defined(BENCHMARK_NATIVE)
> > +    result = native_sin(result); /* calls native */ #else
> > +    result = sin(result);  /* calls internal, random complexity */
> 
> What's the range of 'result'? Seems very small? I think we need to 
> make sure the input argument to sin() in a large range.
> As we need try to optimize for general case.
> 
> Thanks!
> Ruiling
> > +    //result = sin(0.1f + result); /* calls internal, (1) no reduction */
> > +    //result = sin(2.f + result); /* calls internal, (2) fast reduction */
> > +    //result = sin(4001 + result); /* calls internal, (3) slow reduction */
> > +    result *= 0x1p-16;
> > +#endif
> > +  }
> > +
> > +  dst[get_global_id(0)] = result;
> > +}
> > +

_______________________________________________
Beignet mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 1/3] Benchmark: Evaluate math performance on intervals

Reply via email to