One other thing that just came to mind which you might try, assuming your
code uses a limited number of arrays with a lifetime similar to the
program (as opposed to creating/destroying arrays throughout its
execution): Compile with the -snoRefCount flag, which turns off reference
counting for domains and arrays. At present, our reference counting for
arrays is overly conservative in some cases and, as a result, can add a
lot of overhead to programs. The flag above is a short-term workaround
for this issue -- we have some long-term work to improve this situation.
-Brad
On Thu, 15 May 2014, Brad Chamberlain wrote:
>
> Hi Herve --
>
>> * We've been working with Chapel 1.8.0 since the last version was not
>> available yet when we started. Except when we tried to profile the C-code,
>> we alsways used the --fast flag, initialy we did not and we saw the
>> difference in term of performances.
>
> Thoroughly understandable. Switching to 1.9.0, I expect that you will see
> some performance improvements, though there is still much room for
> improvement. I'm glad you were always using the --fast flag -- the lack of
> it in the sample command lines in the email is what made me worry.
>
>> I'll send the complete code tomorrow so that you can have a look when you
>> have time. We are using two 2D arrays to store the cell data, one for the
>> n-1 iteration and one for the n iteration.
>
> You know, in retrospect, I think I misinterpreted something. I was thinking
> that dsiAccess3 implied a 3D array access, but given that you're using 2D
> arrays, I think it must mean that it's the third overload of a function of
> that name. Sorry for the confusion on my part -- ZPL embedded the rank of
> the arrays into such function names, and I think I flashed back to that when
> looking at that stack.
>
>
>> * After I sent the email the first time, I tried to suppress the domain
>> mapping and since I use for single-local execution:
>> const physicalDomain: domain(2) = {1..pb.nb_cell_x, 1..pb.nb_cell_y};
>> instead of
>> const physicalDomain: domain(2) dmapped Block({1..pb.nb_cell_x,
>> 1..pb.nb_cell_y}) = {1..pb.nb_cell_x, 1..pb.nb_cell_y};
>> and the performances we improved greatly.
>> However, I'm also doing multi-local execution on a SGI Altix ICE 8200
>> server, up to 16 nodes of 8 processors that's why I initialy used the Block
>> distribution by default in the code.
>
> Makes sense, and I agree that ultimately, you will want to use the Block
> distribution. That said, as you saw, the Block distribution incurs overheads
> that have not been optimized away yet, and due to the way we do unoptimized
> communication at present (very fine grain, very demand driven), stencil
> patterns are a particularly bad case in Chapel compared to a hand-coded MPI
> kernel. This is what the miniMD/stencil9 work that I mentioned last summer
> was working on improving -- how to coarsen the communications, use ghost
> cells, etc. We can provide more information on that work if you are
> interested. As mentioned previously, it's something we plan/hope to spend
> more time on this year.
>
>
>> * About the distribution used on x_domain and y_domain, I supposed that
>> when the domains are too small it could/would decrease the performances,
>> but I did it initialy because we planned to use large domains (32768x32768
>> cells) and I wanted the arrays containing the boundary data to be treated
>> in parallel.
>> I also tried not to use block distribution for x_domain and y_domain before
>> but on this type of grid I did not see improvements.
>
> To be clear, I do think you want the boundaries distributed, but I was just
> hypotheisizing that you would do better to distribute them relative to the
> physical space rather than independently. I.e., by creating the boundaries
> relative to the physical space, they will still be distributed, but to the
> same locales that own the corresponding cells in the physical space (so, on a
> p x p locale grid, they'd be distributed over the p locales on one edge
> rather than all p x p locales). Again, my thinking is that it's better to
> use a subset of the resources and align the data with the physical space that
> it correlates with to avoid communication than to use all the resources and
> require more communication (and more arbitrary communication) between the
> physical space to boundaries. And again, this is predicated on an assumption
> that there's asymptotically less work going on at the boundaries and
> therefore you can tolerate using only a subset of the locales (esp. since
> other boundaries could be computed simultaneously, allowing you to use
> something like 4p locales in parallel rather than serializing the boundary
> computations).
>
> The one other thing motivating this suggestion is that Chapel has been
> designed such that if a number of domains share the same domain map (as these
> would), it gives the compiler more semantic information about the relative
> alignment (and therefore, lack of need for communication) than if every
> domain has a different domain map. That said, this is a forward-looking
> characterization in that we haven't implemented it yet in Chapel (or, you can
> consider it backward-looking, as it is what we implemented in ZPL).
>
> -Brad
>
>
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users