Hi Tomsy --

This is just a guess, but here are some notes:

* If you haven't run into this advice already somewhere else, be sure to use 
the --fast flag when doing performance comparisons to turn off runtime checks, 
turn on C-level optimizations, turn on vectorization, etc.  See 'man chpl' for 
a description of the --fast flag.

* In case 1, I think you really want to be iterating over {1..n, 1..n} (a 2D n 
x n domain) rather than [1..n, 1..n] (an array of 2 ranges with the value 
1..n).  Perhaps you did this and mistranscribed it into the email, but if not, 
you may want to re-run the experiment.

* In both cases, part of the overhead is going to be creating the domains (or 
array, depending on the previous point), and it'd be interesting to see how 
much of the execution time is the creation of the domain (which tends not to be 
cheap) vs. the iteration over it.  Creating a domain for each loop to iterate 
over O(5) elements is going to be expensive.  Better would be to create the 
domain once, name it, and iterate over it again and again to amortize that 
overhead.

* A third form to consider, which I would expect to be fastest of these three, 
would be:

forall i in 1..5 do A[i, 6-i]=1;

The reason I expect this to be the fastest of the three is that it uses a range 
rather than a domain, which is a lighter-weight object (that said, it can still 
benefit from being named and reused as in the previous bullet).

* The final thing that could hurt performance for loops like these are that if 
you've got five iterations and are running on a 4-core machine (say), there 
won't be much work per iteration, so you're probably using more parallelism 
than is warranted.  I.e., the overhead for creating the parallelism is likely 
to outweigh the benefit of running the loop in parallel.  I'd be curious, for 
example, whether using a 'for' loop for any of the three idioms above would 
make it go faster.  Alternatively, you can dial down the amount of parallelism 
using the dataPar* knobs that are described in 
https://github.com/chapel-lang/chapel/blob/master/doc/release/README.executing

* Briefly, leader-follower and standalone iterators are used to create 
parallelism in forall loops like these.  Leader-follower loops are used for 
zippered parallel iteration (which you aren't using here) and in cases where 
standalone iterators are not available (or, sometimes they aren't used due to 
an immaturity in our implementation that needs to be resolved).  In either 
case, the leader or standalone iterator is responsible for creating the tasks 
that will execute the forall loop and dealing out the iterations to those 
tasks.  If running on a 4-core machine, for example, our default parallel 
iterators on ranges and domains will create 4 tasks and divide the iterations 
amongst them.  If there are 5 iterations, each task will get 1 iteration except 
for one which will get 2 -- given that each iteration is very quick (one array 
access and assignment), this is why I say that the small amount of work per 
iteration may outweigh the overhead of creating the parallelism to begin with.  
For a more thorough explanation of leader-follower and standalone iterators, 
see 
https://github.com/chapel-lang/chapel/blob/master/test/release/examples/primers/parIters.chpl
(or the PGAS paper that defined leader-follower iterators "User-Defined 
Parallel Zippered Iterators in Chapel:  http://chapel.cray.com/papers.html)

* Here's one more way to write the loop that would definitely use 
leader-follower iterators, but which I would still expect to be slower than the 
single-range case:

forall (i,j) in zip(1..5, 1..5 by -1) do A[i,j] = 1;

Hope this is helpful,

-Brad


________________________________
From: Tomsy Paul [[email protected]]
Sent: Sunday, May 31, 2015 12:48 AM
To: chapel-developers
Subject: [Chapel-developers] forall versions performance comparison.

Consider the two loops

1. forall (i,j) in [1..n,1..n] do if (i+j)==n+1 then A[i,j]=1;
and
2. forall (i,j) in {(1,5),(2,4),(3,3),(4,2),(5,1)} do A[i,j]=1;//n==5

I thought that the second would perform far better than the first, because of 
the lesser number of iterations. But for n=10000, the timings was 0.9 seconds 
(for 1) and 0.6 seconds(for 2) on a desktop.


A) Why is it so?
B) Please give a clear picture of leader-follower. Has it got any role in this 
problem?
C) What is a stand alone forall loop? please give an example.

------------------------------------------------------------------------------
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to