Hi Tomsy --
This is just a guess, but here are some notes:
* If you haven't run into this advice already somewhere else, be sure to use
the --fast flag when doing performance comparisons to turn off runtime checks,
turn on C-level optimizations, turn on vectorization, etc. See 'man chpl' for
a description of the --fast flag.
* In case 1, I think you really want to be iterating over {1..n, 1..n} (a 2D n
x n domain) rather than [1..n, 1..n] (an array of 2 ranges with the value
1..n). Perhaps you did this and mistranscribed it into the email, but if not,
you may want to re-run the experiment.
* In both cases, part of the overhead is going to be creating the domains (or
array, depending on the previous point), and it'd be interesting to see how
much of the execution time is the creation of the domain (which tends not to be
cheap) vs. the iteration over it. Creating a domain for each loop to iterate
over O(5) elements is going to be expensive. Better would be to create the
domain once, name it, and iterate over it again and again to amortize that
overhead.
* A third form to consider, which I would expect to be fastest of these three,
would be:
forall i in 1..5 do A[i, 6-i]=1;
The reason I expect this to be the fastest of the three is that it uses a range
rather than a domain, which is a lighter-weight object (that said, it can still
benefit from being named and reused as in the previous bullet).
* The final thing that could hurt performance for loops like these are that if
you've got five iterations and are running on a 4-core machine (say), there
won't be much work per iteration, so you're probably using more parallelism
than is warranted. I.e., the overhead for creating the parallelism is likely
to outweigh the benefit of running the loop in parallel. I'd be curious, for
example, whether using a 'for' loop for any of the three idioms above would
make it go faster. Alternatively, you can dial down the amount of parallelism
using the dataPar* knobs that are described in
https://github.com/chapel-lang/chapel/blob/master/doc/release/README.executing
* Briefly, leader-follower and standalone iterators are used to create
parallelism in forall loops like these. Leader-follower loops are used for
zippered parallel iteration (which you aren't using here) and in cases where
standalone iterators are not available (or, sometimes they aren't used due to
an immaturity in our implementation that needs to be resolved). In either
case, the leader or standalone iterator is responsible for creating the tasks
that will execute the forall loop and dealing out the iterations to those
tasks. If running on a 4-core machine, for example, our default parallel
iterators on ranges and domains will create 4 tasks and divide the iterations
amongst them. If there are 5 iterations, each task will get 1 iteration except
for one which will get 2 -- given that each iteration is very quick (one array
access and assignment), this is why I say that the small amount of work per
iteration may outweigh the overhead of creating the parallelism to begin with.
For a more thorough explanation of leader-follower and standalone iterators,
see
https://github.com/chapel-lang/chapel/blob/master/test/release/examples/primers/parIters.chpl
(or the PGAS paper that defined leader-follower iterators "User-Defined
Parallel Zippered Iterators in Chapel: http://chapel.cray.com/papers.html)
* Here's one more way to write the loop that would definitely use
leader-follower iterators, but which I would still expect to be slower than the
single-range case:
forall (i,j) in zip(1..5, 1..5 by -1) do A[i,j] = 1;
Hope this is helpful,
-Brad
________________________________
From: Tomsy Paul [[email protected]]
Sent: Sunday, May 31, 2015 12:48 AM
To: chapel-developers
Subject: [Chapel-developers] forall versions performance comparison.
Consider the two loops
1. forall (i,j) in [1..n,1..n] do if (i+j)==n+1 then A[i,j]=1;
and
2. forall (i,j) in {(1,5),(2,4),(3,3),(4,2),(5,1)} do A[i,j]=1;//n==5
I thought that the second would perform far better than the first, because of
the lesser number of iterations. But for n=10000, the timings was 0.9 seconds
(for 1) and 0.6 seconds(for 2) on a desktop.
A) Why is it so?
B) Please give a clear picture of leader-follower. Has it got any role in this
problem?
C) What is a stand alone forall loop? please give an example.
------------------------------------------------------------------------------
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers