Hi -
See replies inline below
I am comparing a linear algebra algorithm written in Fortran with my own
Chapel implementation of the same algorithm. About 90% of the computation
time of (slightly more than) half of my algorithm is an operation which
can be written as
inline proc tpxyT(ref t : [?tD] ?R, const ref x : [] R, const ref y : []
R)
{
const (rows, columns) = tD.dims();
[ (i, j) in tD ] t[i, j] += x[i] * y[j];
}
I will come back to the above shortly.
I want to compare my code against Fortran.
I want this comparison to be beyond fair! That is the bottom line on this
posting but do feel free to tell me the above code is less than perfect.
Looking at the last line, it is highly parallelizable. But, I assume when
called from somewhere within a routine that is itself called from with a
serial block, and with the program also fed
--dataParTasksPerLocale=1
then no parallelization will happen, i.e. that construct is just a
whopping great hint to the optimizer.
There is some per-loop overhead from using a forall expression in this
way as compared to a `for` loop. Actually
for (i,j) in vectorizeOnly(tD) {
t[i,j] += x[i] * y[j];
}
would equivalently hint vectorizability (but be aware that
these hints don't do anything today with the C backend).
Is my guess correct?
Would people agree that using this Chapel feature is still scrupulously
fair (in my comparison against Fortran)?
If Chapel vectorized remotely well, the following is superior:
[i in rows] t[i, ..] += x[i] * y;
But Chapel is not so good on vectorization, so this approach was dropped.
I suspect the issue with this construct is more to do with the
overhead of constructing and operating on array views -
combined with nested parallelism. It'd be interesting to try
to improve the performance of this case if it's the way
you'd really like to express your program.
Today, the expression `t[i, ..]` allocates memory in the process
of creating the array view. That can cause performance problems.
I don't know if that's causing problems in your case here.
Anyway it's also possible that overhead from array views is
preventing vectorization within the C compiler.
Note that compared to the first approach and with matrices of the size of
1600*1600 to 4000*4000, my semi-rigorous experiments suggest that this has
somewhere between 10-20% longer run-time using 1.22.0 and a GCC backend
and various Intel Xeon CPUs. I think this second line is a far superior
description of what I am trying to do and I await better vectorization
in Chapel. I was thinking of writing some C-code to do
t[i,..] += x[i] * y
but then realised that I have not a clue how to write an interface like
proc tupv(n : int, m : int, t : T, ref u : [] T, const ref v [] T)
where T is real(32) or real(64) and then use that interface or even if
that interface would be the best one.
I don't think we currently have the ability to pass array views
(such as `t[i, ..]` into `extern` functions. That might be something
we could extend `chpl_external_array` to do, though.
-michael
Thanks - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers