I hate making passes through memory as this can often cripple performance.
In fact, in something like a smart way of computing the euclidean norm, the
memory access time and indexing are the only operations that affect the time
to compute the result. The floating point multiple and addition operations
do not impact the run time (at least in the C/C++ work I have done) as they
run concurrently.
I am looking for an approach or discipline to writing clean Chapel code that
is good enough for now and in the long term will produce the fastest running
code once a compiler produced native code. If anybody has time to comment
on the following, especially where I write
Q!!
I would be interested in opinions.
Consider the following chunk of code from the Crout reduction of the j'th
column of
var a : [1..m, 1..n] : real(w);
Assume that 'm' and 'n' could vary in the range 10..1000.
Yes, tiled Gaussian elimination may be a better approach to the problem but
that is not the point.
So if we have 'aj' as a real vector of length ',', one can write
// localize a copy of the data in column 'j'
aj[1..m] = a[1..m, j];
// reduce the j'th column
//
// because the i'th element of the column depends on elements 1..i-1
// that come before the next loop can be neither a forall loop nor :
// [i in 1..m] aj[i] -= + reduce [k in 1..min(i, j)-1] a[i, k] * aj[k]
for i in 1..m
{
aj[i] -= + reduce [k in 1..min(i, j)-1] a[i, k] * aj[k];
}
// stick aj[:] back in the original matrix
a[1..m, j] = aj[1..m]
// deduce the maximum pivotal value and its row
(ajj, p) = maxloc reduce zip(abs(aj[j..m]), j..m);
This reads well but has a lot passes through memory.
Q!! Am I paranoid, cautious, or silly in worrying about passes through memory?
Now because of the way aj[i] is computed, that first read/copy to localise the
column storage can be eliminated and the write-back done in the 'for' loop so
the above can be written simply to
for i in 1..m
{
a[i, j] -= + reduce [k in 1..min(i, j)-1] a[i, k] * aj[k];
aj[i] = a[i, j];
}
(ajj, p) = maxloc reduce zip(abs(aj[j..m]), j..m);
Q!! in doing this, am I shooting Chapel's internal smarts in the foot.
Continuing, ...
Now even this simplification has work to do in (I assume) in creating a new
vector to store abs(aj[1..m]) and then making a pass through it during the
reduce. In fact it also has to pass through aj[:] on its way to computing
computed the absolute values..
Digressing slightly, note that I assume that eventually
a[i, j] -= ....
aj[i] = a[i, j]
can be optimized as if it were written like (in C)
aj[i] = a[i, j] -= .....
and does not incur the addressing overhead for a[i, j] twice.
Getting back on track, An attempt to do this logic in one pass results in
var ajj = -1.0, aij;
var p = j - 1;
for i in 1..m
{
aij = a[i, j] - (+ reduce [k in 1..min(i, j)-1] a[i, k] *
aj[k]);
aj[i] = aij;
a[i, j] = aij;
if i > p then
{
aij = abs(aij);
if aij > ajj then
{
ajj = aij;
p = j;
}
}
}
Sadly, the clarity of the first attempt is lost but un-necessary passes
through memory are avoided.
Q!! Will the compiler produce good quality code for all that scalar stuff?
Q!! Which variant of the above should be considered the best Chapel solution?
Everybody's opinion of the word 'best' or 'optimal' may vary.
Q!! Is it Chapel's aim that the first attempt will ultimately be the fastest?
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers