On Fri, Jan 9, 2015 at 8:21 PM, Jed Brown <[email protected]> wrote: > Barry Smith <[email protected]> writes: > > > Just say: "I support the pure MPI model on multicore systems > > including KNL" if that is the case or say what you do support; > > I support that (with neighborhood collectives and perhaps > MPI_Win_allocate_shared) if they provide a decent MPI implementation. I > have yet to see a performance model showing why this can't perform at > least as well as any MPI+thread combination. >
I think I liked about MPI is that there seemed to be a modicum of competition. Isn't there someone in the MPI universe with a good neighborhood collective impl? If we can show good perf with some MPI version, people will switch, or vendors will be shamed into submission. Matt > The threads might be easier for some existing applications to use. That > could be important enough to justify work on threading, but it doesn't > mean we should *advocate* threading. > > > Now what about "hardware threads" and pure MPI? Since Intel HSW > > seems to have 2 (or more?) hardware threads per core should there > > be 2 MPI process per core to utilize them both? Should the "extra" > > hardware threads be ignored by us? (Maybe MPI implementation can > > utilize them)? Or should we use two threads per MPI process (and > > one MPI process per core) to utilize them? Or something else? > > Hard to say. Even for embarrassingly parallel operations, using > multiple threads per core is not a slam dunk because you slice up all > your caches. The main benefit of hardware threads is that you get more > registers and can cover more latency from poor prefetch. Sharing cache > between coordinated hardware threads is exotic and special-purpose, but > a good last-step optimization. Can it be done nearly as well with > MPI_Win_allocate_shared? Maybe; that has not been tested. > > > Back when we were actively developing the PETSc thread stuff you > > supported using threads because with large domains > > Doesn't matter with large domains unless you are coordinating threads to > share L1 cache. > > > due to fewer MPI processes there are (potentially) a lot less ghost > > points needed. > > Surface-to-volume ratio is big for small subdomains. If you already > share caches with another process/thread, it's lower overhead to access > it directly instead of copying out into separate blocks with ghosts. > This is the argument for using threads or MPI_Win_allocate_shared > between hardware threads sharing L1. But if you don't stay coordinated, > you're actually worse off because your working set is non-contiguous and > doesn't line up with cache lines. This will lead to erratic performance > as problem size/configuration is changed. > > To my knowledge, the vendors have not provided super low-overhead > primitives for synchronizing between hardware threads that share a core. > So for example, you still need memory fences to prevent reordering > stores to occur after loads. But memory fences are expensive as the > number of cores on the system goes up. John Gunnels coordinates threads > in BQG-HPL using cooperative prefetch. That is basically a side-channel > technique that is non-portable and if everything doesn't match up > perfectly, you silently get bad performance. > > Once again, shared-nothing looks like a good default. > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
