"Nystrom, William David" <[email protected]> writes: > Well, I would really like to be able to do the experiment with PETSc - and I > tried to do > so back in the summer of 2013. But I encountered problems which I documented > with > the current PETSc threadcomm package trying a really simple problem with cg > and > jacobi preconditioning. And I don't believe those problems have been fixed. > And I > don't believe there is any intention of fixing them with the current > threadcomm package. > So I can't do any meaningful experiments with PETSc related to MPI+threads.
Dave, getting objectively good performance with threads is hard. A lot of people try and fail, including Intel engineers trying to optimize just one code. The reason is that MPI+OpenMP is a crappy programming model, especially the way it is usually used (which puts absurdly expensive things like "omp parallel" in the critical path). I don't want to ship finicky crap that runs slower for most users, but the fact is that the community does not know how to make MPI+OpenMP fast for interesting problems. I cite HPGMG-FV as an example because Sam understands hardware well and conceived that code from the ground up for threads, yet it executes faster with MPI on most machines at all problem sizes. I posit that most examples of threads making a PDE solver faster are due to poor use of MPI, poor choice of algorithm, or contrived configuration. I want to make the science and engineering that matters faster, not check a box saying that we "do threads". > Regarding HPGMG-FV, I never heard of it It is the finite volume version of our multigrid benchmark. https://hpgmg.org > and have no idea whether it could be used in an ASC code to do the > linear solves. It's a benchmark, not a library. But it is representative of multigrid solvers. If you can't make it run faster using threads, there's no point trying to use threads in PETSc if you're most concerned about solving real problems as fast as possible. > I have also had some recent experience running a plasma simulation code > called VPIC > on Blue Gene Q with flat MPI and MPI+pthreads. When I run with MPI+pthreads > on > Blue Gene Q, VPIC is noticeably faster even though I can only run with 3 > threads per > rank but can run in flat MPI mode with 4 ranks per core. This is an anecdote. If you can explain why, we can have a productive conversation. Otherwise it's Just Run Shit® and not useful to inform a versatile library. As I mentioned before, some apps may have made decisions that make threads more _usable_ to them. If that's the issue, let's have a conversation about usability, not about solver performance. If solver performance is the first priority, we need to understand the fundamental limitations of each choice. > BTW, if you have references that document experiments comparing performance of > flat MPI with MPI+threads, I would be happy to read them. https://hpgmg.org/lists/archives/hpgmg-forum/2014-August/000091.html Exploring Shared-memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems, Dheevatsa Mudigere, Srinivas Sridharan, Anand Deshpande, Jongsoo Park, Alexander Heinecke, Mikhail Smelyanskiy, Bharat Kaul, Pradeep Dubey, Dinesh Kaushik, and David Keyes, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015, accepted for publication https://www2.cisl.ucar.edu/sites/default/files/maynard_5a.pdf https://www2.cisl.ucar.edu/sites/default/files/durachta_4.pdf
signature.asc
Description: PGP signature
