One other thing that just came to mind which you might try, assuming your 
code uses a limited number of arrays with a lifetime similar to the 
program (as opposed to creating/destroying arrays throughout its 
execution):  Compile with the -snoRefCount flag, which turns off reference 
counting for domains and arrays.  At present, our reference counting for 
arrays is overly conservative in some cases and, as a result, can add a 
lot of overhead to programs.  The flag above is a short-term workaround 
for this issue -- we have some long-term work to improve this situation.

-Brad


On Thu, 15 May 2014, Brad Chamberlain wrote:

>
> Hi Herve --
>
>> * We've been working with Chapel 1.8.0 since the last version was not 
>> available yet when we started. Except when we tried to profile the C-code, 
>> we alsways used the --fast flag, initialy we did not and we saw the 
>> difference in term of performances.
>
> Thoroughly understandable.  Switching to 1.9.0, I expect that you will see 
> some performance improvements, though there is still much room for 
> improvement.  I'm glad you were always using the --fast flag -- the lack of 
> it in the sample command lines in the email is what made me worry.
>
>> I'll send the complete code tomorrow so that you can have a look when you 
>> have time. We are using two 2D arrays to store the cell data, one for the 
>> n-1 iteration and one for the n iteration.
>
> You know, in retrospect, I think I misinterpreted something.  I was thinking 
> that dsiAccess3 implied a 3D array access, but given that you're using 2D 
> arrays, I think it must mean that it's the third overload of a function of 
> that name.  Sorry for the confusion on my part -- ZPL embedded the rank of 
> the arrays into such function names, and I think I flashed back to that when 
> looking at that stack.
>
>
>> * After I sent the email the first time, I tried to suppress the domain 
>> mapping and since I use for single-local execution:
>> const physicalDomain: domain(2) = {1..pb.nb_cell_x, 1..pb.nb_cell_y};
>> instead of
>> const physicalDomain: domain(2) dmapped Block({1..pb.nb_cell_x, 
>> 1..pb.nb_cell_y}) = {1..pb.nb_cell_x, 1..pb.nb_cell_y};
>> and the performances we improved greatly.
>> However, I'm also doing multi-local execution on a SGI Altix ICE 8200 
>> server, up to 16 nodes of 8 processors that's why I initialy used the Block 
>> distribution by default in the code.
>
> Makes sense, and I agree that ultimately, you will want to use the Block 
> distribution.  That said, as you saw, the Block distribution incurs overheads 
> that have not been optimized away yet, and due to the way we do unoptimized 
> communication at present (very fine grain, very demand driven), stencil 
> patterns are a particularly bad case in Chapel compared to a hand-coded MPI 
> kernel.  This is what the miniMD/stencil9 work that I mentioned last summer 
> was working on improving -- how to coarsen the communications, use ghost 
> cells, etc.  We can provide more information on that work if you are 
> interested.  As mentioned previously, it's something we plan/hope to spend 
> more time on this year.
>
>
>> * About the distribution used on x_domain and y_domain, I supposed that 
>> when the domains are too small it could/would decrease the performances, 
>> but I did it initialy because we planned to use large domains (32768x32768 
>> cells) and I wanted the arrays containing the boundary data to be treated 
>> in parallel.
>> I also tried not to use block distribution for x_domain and y_domain before 
>> but on this type of grid I did not see improvements.
>
> To be clear, I do think you want the boundaries distributed, but I was just 
> hypotheisizing that you would do better to distribute them relative to the 
> physical space rather than independently.  I.e., by creating the boundaries 
> relative to the physical space, they will still be distributed, but to the 
> same locales that own the corresponding cells in the physical space (so, on a 
> p x p locale grid, they'd be distributed over the p locales on one edge 
> rather than all p x p locales).  Again, my thinking is that it's better to 
> use a subset of the resources and align the data with the physical space that 
> it correlates with to avoid communication than to use all the resources and 
> require more communication (and more arbitrary communication) between the 
> physical space to boundaries.  And again, this is predicated on an assumption 
> that there's asymptotically less work going on at the boundaries and 
> therefore you can tolerate using only a subset of the locales (esp. since 
> other boundaries could be computed simultaneously, allowing you to use 
> something like 4p locales in parallel rather than serializing the boundary 
> computations).
>
> The one other thing motivating this suggestion is that Chapel has been 
> designed such that if a number of domains share the same domain map (as these 
> would), it gives the compiler more semantic information about the relative 
> alignment (and therefore, lack of need for communication) than if every 
> domain has a different domain map.  That said, this is a forward-looking 
> characterization in that we haven't implemented it yet in Chapel (or, you can 
> consider it backward-looking, as it is what we implemented in ZPL).
>
> -Brad
>
>

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Reply via email to