On Sun, Jun 19, 2011 at 8:12 PM, Jed Brown <jed at 59a2.org> wrote: > On Sun, Jun 19, 2011 at 21:44, Barry Smith <bsmith at mcs.anl.gov> wrote: > >> > here is the stream benchmark results that Hongzhang Shan collected on >> Hopper for Nick's COE studies. The red curve shows performance when you >> run stream when all of the data ends up mapped to a single memory >> controller. The blue curve shows the case when you correctly map data using >> first-touch so that the stream benchmark accesses data on its local memory >> controller (the correct NUMA mapping). >> > > If I have it correct, each socket of this machine (2-socket 12-core) has 4 > DDR3-1333 memory buses, for a theoretical peak of 85 GB/s per node. That > they get 50 GB/s is "good" by current standards. > > >> > >> > The bottom line is that it is essential that data is touched first on >> the memory controller that is nearest the OpenMP processes that will be >> accessing it (otherwise memory bandwidth will tank). This should occur >> naturally if you configure as 4 NUMA nodes with 6 threads each, as per >> Nathan's suggestion. If we want to be more aggressive and use 24-way >> threaded parallelism per node, then extra care must be taken to ensure the >> memory affinity is not screwed up. >> > > Note that this is when the memory is first touched, not when it is first > allocated (allocation just sets a pointer, it doesn't find you memory or > decide where it will come from). Memset() will not do this correctly at all. > > One tricky case is sparse matrix allocation if the number of nonzeros per > row is not nearly constant (or random). Banded matrices are a worst-case > scenario. In such cases, it is difficult to get both the matrix and the > vector faulted more-or-less in the correct place. > > If you use MPI processes down to the die level (6 threads), then you don't > have to put extra effort into memory affinity. This is not a bad solution > right now, but my prediction for the future is that we'll still see a > hierarchy with a similar granularity (2 to 8-way branching factor at each > level). In that case, saying that all threads of a given process will have a > flat (aside from cache which is more local) memory model is about as > progressive as ignoring threads entirely is today. >
Isn't there an API for prescribing affinity, or is this too hard to do for most things? It seems like we know what we want in matvec. Matt -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110619/8c92ac07/attachment.html>
