Begin forwarded message:
> From: John Shalf <jshalf at lbl.gov> > Date: June 19, 2011 5:34:59 AM CDT > To: Barry Smith <bsmith at mcs.anl.gov> > Cc: Nathan Wichmann <wichmann at cray.com>, Lois Curfman McInnes <curfman at > mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges > at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier > <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov> > Subject: Re: Poisson step in GTS > > Hi Barry, > here is the stream benchmark results that Hongzhang Shan collected on Hopper > for Nick's COE studies. The red curve shows performance when you run stream > when all of the data ends up mapped to a single memory controller. The blue > curve shows the case when you correctly map data using first-touch so that > the stream benchmark accesses data on its local memory controller (the > correct NUMA mapping). > > The bottom line is that it is essential that data is touched first on the > memory controller that is nearest the OpenMP processes that will be accessing > it (otherwise memory bandwidth will tank). This should occur naturally if > you configure as 4 NUMA nodes with 6 threads each, as per Nathan's > suggestion. If we want to be more aggressive and use 24-way threaded > parallelism per node, then extra care must be taken to ensure the memory > affinity is not screwed up. > > -john > > On Jun 18, 2011, at 10:13 AM, Barry Smith wrote: >> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote: >>> Hi Robert, Barry and all, >>> >>> Is it our assumption that the Poisson version of GTS will normally be run >>> with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads? >> >> Our new vector and matrix classes will allow the flexibility of any number >> of MPI processes and any number of threads under that. So 1 MPI rank and 6 >> threads is supportable. >> >>> In that case there should be sufficient bandwidth for decent scaling; I >>> would say something Barry's Intel experience. Barry is certainly correct >>> that as one uses more cores one will be more bandwidth limited. >> >> I would be interested in seeing the OpenMP streams for this system. >>> >>> I also like John's comment: "we have little faith that the compiler will do >>> anything intelligent." Which compiler are you using? If you are using CCE >>> then you should get a lst file to see what it is doing. Probably the only >>> thing that can and should be done is unroll the inner loop. >> >> Do you folks a provide a thread based BLAS 1 operations? For example ddot, >> dscale, daxpy? If so, we can piggy-back on those to get the best possible >> performance on the vector operations., >>> >>> Another consideration is the typical size of "n". Normally the dense the >>> matrix the large n is, no? But still, it would be interesting to know. >> >> In this application the matrix is extremely sparse, likely between 7 and 27 >> nonzeros per row. Matrices, of course, can get as big as you like. >> >> Barry > [see attached file: PastedGraphic-1.pdf] -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-1.pdf Type: application/pdf Size: 30010 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110619/7e4caa75/attachment.pdf>
