Hi fellow Chapel coders ---

We're looking at optimizing a 2-d loop over a triangular iteration space
that uses a 2-d array. My impression is that, if we use an ordinary
parallel loop (forall) for the outer loop and have, e.g., 4 cores, that
each core will get a consecutive quarter of the outer loop iterations
(i.e., 0 .. N/4 for core 0), and so we'll get load imbalance.

One option would be to improve balance with a cyclic distribution, which
would likely create issues with false sharing of cache; a block-cyclic
distribution could address this, and we're looking into that. In principle,
it would also be interesting to assign the first 1/4 of the total work to
core 0, the next 1/4 of the total to core 1, etc.... we're planning to do
this by hand, but are curious if there are any other relevant Chapel forall
directives, or other options we've not considered.

Thanks for any thoughts about this,
   Dave Wonnacott & Sehyeok Park
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to