> On Jul 9, 2018, at 8:33 AM, Jeff Hammond <jeff.scie...@gmail.com> wrote: > > > > On Fri, Jul 6, 2018 at 4:28 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote: > > Richard, > > The problem is that OpenMP is too large and has too many different > programming models imbedded in it (and it will get worse) to "support OpenMP" > from PETSc. > > This is also true of MPI. You can write CSP, BSP, PGAS, fork-join, > agent-based, etc. in MPI. Just like MPI, you don't have to use all the > features. PETSc doesn't use MPI_Comm_spawn, MPI_Rget_accumulate, or > MPI_Neighborhood_alltoallv, does it? > > One way to use #pragma based optimization tools (which is one way to > treat OpenMP) is to run the application code in a realistic size problem, > using the number of threads/MPI process they prefer with profiling and begin > adding #pragmas to the most time consuming code fragments/routines, measuring > the (small) improvement in performance as they are added. This is the way I > would proceed. The branch generated will not have very many pragmas in it so > would likely be acceptable to be included into PETSc. It would also give a > quantitative measure of the possible performance with the #prama approach. > > This is the textbook Wrong Way to write OpenMP and the reason that the > thread-scalability of DOE applications using MPI+OpenMP sucks. It leads to > codes that do fork-join far too often and suffer from death by Amdahl, unless > you do a second pass where you fuse all the OpenMP regions and replace the > serial regions between them with critical sections or similar. > > This isn't how you'd write MPI, is it? No, you'd figure out how to decompose > your data properly to exploit locality and then implement an algorithm that > minimizes communication and synchronization. Do that with OpenMP. > > Note: that for BLAS 1 operations likely the correct thing to do is turn on > MKL BLAS threading (being careful to make sure the number of threads MKL uses > matches that used by other parts of the code). This way we don't need to > OpenMP optimize many parts of PETSc's vector operations (norm, dot, scale, > axpy). In fact, this is the first thing Mark should do, how much does it > speed up the vector operations? > > BLAS1 operations are all memory-bound unless running out of cache (in which > case one shouldn't use threads) and compilers do a great job with them. Just > put the pragmas on and let the compiler do its job.
PETSc currently calls BLAS for these operations so using threaded blas is a natural approach rather than needing to provide new handwritten kernels to add the #pragmas to. Barry > > The problem is how many ECP applications actually use OpenMP just as a > #pragma optimization tool, or do they use other features of OpenMP. For > example I remember Brian wanted to/did use OpenMP threads directly in BoxLib > and didn't just stick to the #pragma model. If they did this then we would > need custom PETSc to match their model. > > If this implies that BoxLib will use omp-parallel and then use explicit > threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size and > omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to write OpenMP. > > Unfortunately, the Right Way to use OpenMP makes it hard to use MPI unless > you use MPI_THREAD_MULTIPLE and endpoints. ECP projects should be pushing > the MPI folks harder to ratify and implement endpoints. I don't know if the > proposal is even active right now, but that doesn't prevent DOE from > compelling Open-MPI and MPICH to support it. > > To end on a positive note, OpenMP tasking is a relatively composable model > and supports DAG-based parallelism. I suspect the initial results in a code > like PETSc will be worse than with traditional implicit OpenMP (omp-for-simd > on all the loops) but it eventually wins own because it doesn't require any > unnecessary barriers and makes it much easier to fuse parallel regions. > > Jeff > > > Barry > > > > On Jul 6, 2018, at 3:07 PM, Mills, Richard Tran <rtmi...@anl.gov> wrote: > > > > True, Barry. But, unfortunately, I think Jed's argument has something to it > > because the hybrid MPI + OpenMP model has become so popular. I know of a > > few codes where adopting this model makes some sense, though I believe > > that, more often, the model has been adopted simply because it is the > > fashionable thing to do. Regardless of good or bad reasons for its > > adoption, I do have some real concern that codes that use this model have a > > difficult time using PETSc effectively because of the lack of thread > > support. Like many of us, I had hoped that endpoints would make it into the > > MPI standard and this would provide a reasonable mechanism for integrating > > PETSc with codes using MPI+threads, but progress on this seems to have > > stagnated. I hope that the MPI endpoints effort eventually goes somewhere, > > but what can we do in the meantime? Within the DOE ECP program, the > > MPI+threads approach is being pushed really hard, and many of the ECP > > subprojects have adopted it. I think it's mostly idiotic, but I think it's > > too late to turn the tide and convince most people that pure MPI is the way > > to go. Meanwhile, my understanding is that we need to be able to support > > more of the ECP application projects to justify the substantial funding we > > are getting from the program. Many of these projects are dead-set on using > > OpenMP. (I note that I believe that the folks Mark is trying to help with > > PETSc and OpenMP are people affiliated with Carl Steefel's ECP subsurface > > project.) > > > > Since it looks like MPI endpoints are going to be a long time (or possibly > > forever) in coming, I think we need (a) stopgap plan(s) to support this > > crappy MPI + OpenMP model in the meantime. One possible approach is to do > > what Mark is trying with to do with MKL: Use a third party library that > > provides optimized OpenMP implementations of computationally expensive > > kernels. It might make sense to also consider using Karl's ViennaCL library > > in this manner, which we already use to support GPUs, but which I believe > > (Karl, please let me know if I am off-base here) we could also use to > > provide OpenMP-ized linear algebra operations on CPUs as well. Such > > approaches won't use threads for lots of the things that a PETSc code will > > do, but might be able to provide decent resource utilization for the most > > expensive parts for some codes. > > > > Clever ideas from anyone on this list about how to use an adequate number > > of MPI ranks for PETSc while using only a subset of these ranks for the > > MPI+OpenMP application code will be appreciated, though I don't know if > > there are any good solutions. > > > > --Richard > > > > On Wed, Jul 4, 2018 at 11:38 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote: > > > > Jed, > > > > You could use your same argument to argue PETSc should do "something" > > to help people who have (rightly or wrongly) chosen to code their > > application in High Performance Fortran or any other similar inane parallel > > programming model. > > > > Barry > > > > > > > > > On Jul 4, 2018, at 11:51 PM, Jed Brown <j...@jedbrown.org> wrote: > > > > > > Matthew Knepley <knep...@gmail.com> writes: > > > > > >> On Wed, Jul 4, 2018 at 4:51 PM Jeff Hammond <jeff.scie...@gmail.com> > > >> wrote: > > >> > > >>> On Wed, Jul 4, 2018 at 6:31 AM Matthew Knepley <knep...@gmail.com> > > >>> wrote: > > >>> > > >>>> On Tue, Jul 3, 2018 at 10:32 PM Jeff Hammond <jeff.scie...@gmail.com> > > >>>> wrote: > > >>>> > > >>>>> > > >>>>> > > >>>>> On Tue, Jul 3, 2018 at 4:35 PM Mark Adams <mfad...@lbl.gov> wrote: > > >>>>> > > >>>>>> On Tue, Jul 3, 2018 at 1:00 PM Richard Tran Mills <rtmi...@anl.gov> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hi Mark, > > >>>>>>> > > >>>>>>> I'm glad to see you trying out the AIJMKL stuff. I think you are the > > >>>>>>> first person trying to actually use it, so we are probably going to > > >>>>>>> expose > > >>>>>>> some bugs and also some performance issues. My somewhat limited > > >>>>>>> testing has > > >>>>>>> shown that the MKL sparse routines often perform worse than our own > > >>>>>>> implementations in PETSc. > > >>>>>>> > > >>>>>> > > >>>>>> My users just want OpenMP. > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> Why not just add OpenMP to PETSc? I know certain developers hate it, > > >>>>> but > > >>>>> it is silly to let a principled objection stand in the way of > > >>>>> enabling users > > >>>>> > > >>>> > > >>>> "if that would deliver the best performance for NERSC users." > > >>>> > > >>>> You have answered your own question. > > >>>> > > >>> > > >>> Please share the results of your experiments that prove OpenMP does not > > >>> improve performance for Mark’s users. > > >>> > > >> > > >> Oh God. I am supremely uninterested in minutely proving yet again that > > >> OpenMP is not better than MPI. > > >> There are already countless experiments. One more will not add anything > > >> of > > >> merit. > > > > > > Jeff assumes an absurd null hypothesis, Matt selfishly believes that > > > users should modify their code/execution environment to subscribe to a > > > more robust and equally performant approach, and the MPI forum abdicates > > > by stalling on endpoints. How do we resolve this? > > > > > >>> Also we are not in the habit of fucking up our codebase in order to > > >>> follow > > >>>> some fad. > > >>>> > > >>> > > >>> If you can’t use OpenMP without messing up your code base, you probably > > >>> don’t know how to design software. > > >>> > > >> > > >> That is an interesting, if wrong, opinion. It might be your contention > > >> that > > >> sticking any random paradigm in a library should > > >> be alright if its "well designed"? I have never encountered such a > > >> well-designed library. > > >> > > >> > > >>> I guess if you refuse to use _Pragma because C99 is still a fad for you, > > >>> it is harder, but clearly _Complex is tolerated. > > >>> > > >> > > >> Yes, littering your code with preprocessor directives improves almost > > >> everything. Doing proper resource management > > >> using Pragmas, in an environment with several layers of libraries, is a > > >> dream. > > >> > > >> > > >>> More seriously, you’ve adopted OpenMP hidden behind MKL > > >>> > > >> > > >> Nope. We can use MKL with that crap shutoff. > > >> > > >> > > >>> so I see no reason why you can’t wrap OpenMP implementations of the > > >>> PETSc > > >>> sparse kernels in a similar manner. > > >>> > > >> > > >> We could, its just a colossal waste of time and effort, as well as > > >> counterproductive for the codebase :) > > > > > > Endpoints either need to become a thing we can depend on or we need a > > > solution for users that insist on using threads (even if their decision > > > to use threads is objectively bad). The problem Matt harps on is > > > legitimate: OpenMP parallel regions cannot reliably cross module > > > boundaries except for embarrassingly parallel operations. This means > > > loop-level omp parallel which significantly increases overhead for small > > > problem sizes (e.g., slowing coarse grid solves and strong scaling > > > limits). It can be done and isn't that hard, but the Imperial group > > > discarded their branch after observing that it also provided no > > > performance benefit. However, I'm coming around to the idea that PETSc > > > should do it so that there is _a_ solution for users that insist on > > > using threads in a particular way. Unless Endpoints become available > > > and reliable, in which case we could do it right. > > > > > > > > > -- > Jeff Hammond > jeff.scie...@gmail.com > http://jeffhammond.github.io/