Hi fellow Chapel coders --- We're looking at optimizing a 2-d loop over a triangular iteration space that uses a 2-d array. My impression is that, if we use an ordinary parallel loop (forall) for the outer loop and have, e.g., 4 cores, that each core will get a consecutive quarter of the outer loop iterations (i.e., 0 .. N/4 for core 0), and so we'll get load imbalance.
One option would be to improve balance with a cyclic distribution, which would likely create issues with false sharing of cache; a block-cyclic distribution could address this, and we're looking into that. In principle, it would also be interesting to assign the first 1/4 of the total work to core 0, the next 1/4 of the total to core 1, etc.... we're planning to do this by hand, but are curious if there are any other relevant Chapel forall directives, or other options we've not considered. Thanks for any thoughts about this, Dave Wonnacott & Sehyeok Park
_______________________________________________ Chapel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-users
