OK, this is a Chapel version of the Deriche image-processing kernel from
the PolyBench benchmark. It involves some loops that look like this:
for i in 0..w-1 {
ym1 = 0.0;
ym2 = 0.0;
xm1 = 0.0;
for j in 0..h-1 {
y1[i,j] = a1*imgIn[i,j] + a2*xm1 + b1*ym1 + b2*ym2;
xm1 = imgIn[i,j];
ym2 = ym1;
ym1 = y1[i,j];
}
}
and which we've abstracted with a class that captures the idea of "keep
track of the previously-written value" like this:
class ourArray {
const W: int;
const H: int;
const dom: domain(2);//maybe shouldn't be a const
var Vals: [dom] real;
var mostRecentWrite: real; //ym1
var previousWrite: real; //ym2
proc derrayConcise2Triererer(width: int, height: int){
W = width; H = height; dom = {0..W-1,0..H-1}; }
proc set(i: int, j: int, value: real) {
previousWrite = mostRecentWrite; # not used in try1 or try2
mostRecentWrite = value; # not used in try1 or try2
Vals[i,j] = mostRecentWrite; # set to 'value' in try1 and try2
}
proc get(i: int, j: int) { return Vals[i,j]; }
proc resetScalars() { mostRecentWrite = 0; previousWrite = 0; }
proc get_mRW() { return mostRecentWrite; }
// ...
for i in 0..w-1 {
xm1 = 0.0;
y1.resetScalars();
for j in 0..h-1 {
y1.set(i,j, a1*imgIn[i,j] + a2*xm1 + b1*y1.get_mRW() + b2*y1.get_pW());
xm1 = imgIn[i,j];
}
}
When we leave the original scalars in place and use get/set on the
*array* elements,
we don't lose much performance, even if we're setting redundant scalar
values in the class as well as the main oop. However, if we use the
'get'_mRW' and 'get_pW' method to access the scalars, we get a much more
significant drop. I can attach a large collection of files, if you need
more than just the basic idea.
I've attached a graph of performance (higher=faster) as a function of data
set size (number of bytes per array), with various expeirments. There are 3
runs for each code, so there is also a minor x-offset so that things don't
end up on top of each other and you can see the data; in other words, what
look sort of like two columns on the left really all use the same problem
size. Original, try1, and try2 are just variants on how we access the
*array* data; try3 involves storing the scalars in the 'set' method, and
try4 and try5 and concise are various approaches to *using* the scalars
from the class... try4 is the code shown in the loop above.
Dave W
On Wed, Mar 21, 2018 at 4:34 PM, Brad Chamberlain <[email protected]> wrote:
>
> Hi David --
>
> I think the tools you'd use to optimize cases like this depend heavily on
> the idioms in the code. Can you share a simplified program that
> demonstrates the pattern you're wrestling with as a basis for further
> conversation?
>
> Thanks,
> -Brad
>
>
> On Wed, 21 Mar 2018, David G. Wonnacott wrote:
>
> I've done some experiments and found that the performance of some code I'm
>> writing seems to be limited by the use of 'get' methods to access some
>> scalars. In C++, I'd use the 'inline' keyword to try to optimize these ...
>> is there an equivalent for Chapel? Should I be changing the class to a
>> record? That would be slightly inconvenent but not really a problem.
>>
>> Dave W
>>
>>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users