Begin forwarded message:

> From: John Shalf <jshalf at lbl.gov>
> Date: June 19, 2011 5:34:59 AM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>
> Cc: Nathan Wichmann <wichmann at cray.com>, Lois Curfman McInnes <curfman at 
> mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges 
> at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier 
> <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov>
> Subject: Re: Poisson step in GTS
> 
> Hi Barry,
> here is the stream benchmark results that Hongzhang Shan collected on Hopper 
> for Nick's COE studies.   The red curve shows performance when you run stream 
> when all of the data ends up mapped to a single memory controller.  The blue 
> curve shows the case when you correctly map data using first-touch so that 
> the stream benchmark accesses data on its local memory controller (the 
> correct NUMA mapping). 
> 
> The bottom line is that it is essential that data is touched first on the 
> memory controller that is nearest the OpenMP processes that will be accessing 
> it (otherwise memory bandwidth will tank).  This should occur naturally if 
> you configure as 4 NUMA nodes with 6 threads each, as per Nathan's 
> suggestion.  If we want to be more aggressive and use 24-way threaded 
> parallelism per node, then extra care must be taken to ensure the memory 
> affinity is not screwed up.
> 
> -john
> 
> On Jun 18, 2011, at 10:13 AM, Barry Smith wrote:
>> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote:
>>> Hi Robert, Barry and all,
>>> 
>>> Is it our assumption that the Poisson version of GTS will normally be run 
>>> with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads?
>> 
>>  Our new vector and matrix classes will allow the flexibility of any number 
>> of MPI processes and any number of threads under that. So 1 MPI rank and 6 
>> threads is supportable.
>> 
>>> In that case there should be sufficient bandwidth for decent scaling; I 
>>> would say something Barry's Intel experience.  Barry is certainly correct 
>>> that as one uses more cores one will be more bandwidth limited.
>> 
>>  I would be interested in seeing the OpenMP streams for this system.
>>> 
>>> I also like John's comment: "we have little faith that the compiler will do 
>>> anything intelligent."  Which compiler are you using?  If you are using CCE 
>>> then you should get a lst file to see what it is doing.  Probably the only 
>>> thing that can and should be done is unroll the inner loop.
>> 
>> Do you folks a provide a thread based BLAS 1 operations? For example ddot, 
>> dscale, daxpy? If so, we can piggy-back on those to get the best possible 
>> performance on the vector operations.,
>>> 
>>> Another consideration is the typical size of "n".  Normally the dense the 
>>> matrix the large n is, no?  But still, it would be interesting to know.
>> 
>> In this application the matrix is extremely sparse, likely between 7 and 27 
>> nonzeros per row. Matrices, of course, can get as big as you like.
>> 
>>  Barry
> 
[see attached file: PastedGraphic-1.pdf]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-1.pdf
Type: application/pdf
Size: 30010 bytes
Desc: not available
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110619/7e4caa75/attachment.pdf>

Reply via email to