Dave Nystrom <[email protected]> writes: > When you say getting good performance with threads is hard, do you mean for > complicated preconditioners like multigrid and incomplete factorization > methods? Or do you mean that it is hard to write a good cg solver with > simple preconditioning methods like jacobi and block jacobi?
Define good. If you mean "runs faster than well-design MPI-only across a range of parameters", then it's a tough challenge even for a bogus algorithm like CG/Jacobi. > You say the reason is because "MPI+OpenMP is a crappy programming model". > What about MPI+Pthreads? Same issues, but at least it's idiomatic to use lower level primitives instead of the crappy ones OpenMP provides. The message packing/unpacking problems remain -- either poor bandwidth, poor latency, or both, unless you eschew the idea that the parallelism is contained within the public API (versus called thread-collectively, a programming model that no other libraries use). > > I cite HPGMG-FV as an example because Sam understands hardware well and > > conceived that code from the ground up for threads, yet it executes > > faster with MPI on most machines at all problem sizes. > > > > I posit that most examples of threads making a PDE solver faster are due > > to poor use of MPI, poor choice of algorithm, or contrived > > configuration. I want to make the science and engineering that matters > > faster, not check a box saying that we "do threads". > > Well, so do I - to your last sentence. But is it really possible to run 300+ > MPI ranks on a single node as efficiently as running a single rank on a node > plus 300+ threads - where the threads are pthreads or perhaps a special light > weight thread? That is an honest question, not a rhetorical one because I > don't really know how light weight a vendor could make MPI ranks on a node > versus threads on a node. And it seems that the number of threads or MPI > processes per node is going to continue to get larger and larger as the march > to exascale continues. A lot of people seem to think we will need MPI+X with > MPI just being used between nodes. A lot of people say whatever makes their product sell (be it a research program, hardware, or software). That doesn't make it correct or even what they predict will happen. Current NICs have hardware support for a number of software contexts. Usually those contexts are message queues of some sort that the hardware polls, so software does not have to lock or otherwise serialize to those contexts. Current implementations typically have one NIC context per MPI process. I think it would be difficult to have a semantically correct MPI implementation that uses multiple NIC contexts per process. Anyway, if you only use one context per node, you have to serialize all "threads" in software. That's slow as hell, especially on the new throughput architectures that do everything badly, but are even worse at synchronization latency. So what else can be done to cut latency further or improve bandwidth? Over-decomposition (one subdomain per hardware thread) results in more messages or the need to coalesce. By using fewer processes, we could have threads work together to pack coalesced deduplicated buffers. That would be fewer messages and less bandwidth, so maybe it's a good idea. But packing represents a sizable fraction of messaging cost, so doing it in serial is non-scalable and if you use omp parallel to pack, you've just incurred a latency cost much larger than MPI messaging latency. [email protected]">http://mid.mail-archive.com/[email protected] If all you want is message coalescing, it's possible to do scalably in software with neighborhood collectives. The implementations don't do this now, but they could at similar cost to the best generic threaded implementations. Deduplication would also be possible with "w" versions and persistent neighborhood collectives (not currently part of the standard), but the analysis is nontrivial. So you want threads to pack deduplicated coalesced buffers in parallel, but you can't afford to use omp parallel or omp barrier, so you need thread-collective interfaces with fine-grained coordination that does not use the crude OpenMP primitives. But nobody writes libraries that way and it's not really what you're asking for. Private is a better default.
signature.asc
Description: PGP signature
