> On Jul 9, 2018, at 8:33 AM, Jeff Hammond <[email protected]> wrote:
>
>
>
> On Fri, Jul 6, 2018 at 4:28 PM, Smith, Barry F. <[email protected]> wrote:
>
> Richard,
>
> The problem is that OpenMP is too large and has too many different
> programming models imbedded in it (and it will get worse) to "support OpenMP"
> from PETSc.
>
> This is also true of MPI. You can write CSP, BSP, PGAS, fork-join,
> agent-based, etc. in MPI. Just like MPI, you don't have to use all the
> features. PETSc doesn't use MPI_Comm_spawn, MPI_Rget_accumulate, or
> MPI_Neighborhood_alltoallv, does it?
>
> One way to use #pragma based optimization tools (which is one way to
> treat OpenMP) is to run the application code in a realistic size problem,
> using the number of threads/MPI process they prefer with profiling and begin
> adding #pragmas to the most time consuming code fragments/routines, measuring
> the (small) improvement in performance as they are added. This is the way I
> would proceed. The branch generated will not have very many pragmas in it so
> would likely be acceptable to be included into PETSc. It would also give a
> quantitative measure of the possible performance with the #prama approach.
>
> This is the textbook Wrong Way to write OpenMP and the reason that the
> thread-scalability of DOE applications using MPI+OpenMP sucks. It leads to
> codes that do fork-join far too often and suffer from death by Amdahl, unless
> you do a second pass where you fuse all the OpenMP regions and replace the
> serial regions between them with critical sections or similar.
>
> This isn't how you'd write MPI, is it? No, you'd figure out how to decompose
> your data properly to exploit locality and then implement an algorithm that
> minimizes communication and synchronization. Do that with OpenMP.
>
> Note: that for BLAS 1 operations likely the correct thing to do is turn on
> MKL BLAS threading (being careful to make sure the number of threads MKL uses
> matches that used by other parts of the code). This way we don't need to
> OpenMP optimize many parts of PETSc's vector operations (norm, dot, scale,
> axpy). In fact, this is the first thing Mark should do, how much does it
> speed up the vector operations?
>
> BLAS1 operations are all memory-bound unless running out of cache (in which
> case one shouldn't use threads) and compilers do a great job with them. Just
> put the pragmas on and let the compiler do its job.
PETSc currently calls BLAS for these operations so using threaded blas is a
natural approach rather than needing to provide new handwritten kernels to add
the #pragmas to.
Barry
>
> The problem is how many ECP applications actually use OpenMP just as a
> #pragma optimization tool, or do they use other features of OpenMP. For
> example I remember Brian wanted to/did use OpenMP threads directly in BoxLib
> and didn't just stick to the #pragma model. If they did this then we would
> need custom PETSc to match their model.
>
> If this implies that BoxLib will use omp-parallel and then use explicit
> threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size and
> omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to write OpenMP.
>
> Unfortunately, the Right Way to use OpenMP makes it hard to use MPI unless
> you use MPI_THREAD_MULTIPLE and endpoints. ECP projects should be pushing
> the MPI folks harder to ratify and implement endpoints. I don't know if the
> proposal is even active right now, but that doesn't prevent DOE from
> compelling Open-MPI and MPICH to support it.
>
> To end on a positive note, OpenMP tasking is a relatively composable model
> and supports DAG-based parallelism. I suspect the initial results in a code
> like PETSc will be worse than with traditional implicit OpenMP (omp-for-simd
> on all the loops) but it eventually wins own because it doesn't require any
> unnecessary barriers and makes it much easier to fuse parallel regions.
>
> Jeff
>
>
> Barry
>
>
> > On Jul 6, 2018, at 3:07 PM, Mills, Richard Tran <[email protected]> wrote:
> >
> > True, Barry. But, unfortunately, I think Jed's argument has something to it
> > because the hybrid MPI + OpenMP model has become so popular. I know of a
> > few codes where adopting this model makes some sense, though I believe
> > that, more often, the model has been adopted simply because it is the
> > fashionable thing to do. Regardless of good or bad reasons for its
> > adoption, I do have some real concern that codes that use this model have a
> > difficult time using PETSc effectively because of the lack of thread
> > support. Like many of us, I had hoped that endpoints would make it into the
> > MPI standard and this would provide a reasonable mechanism for integrating
> > PETSc with codes using MPI+threads, but progress on this seems to have
> > stagnated. I hope that the MPI endpoints effort eventually goes somewhere,
> > but what can we do in the meantime? Within the DOE ECP program, the
> > MPI+threads approach is being pushed really hard, and many of the ECP
> > subprojects have adopted it. I think it's mostly idiotic, but I think it's
> > too late to turn the tide and convince most people that pure MPI is the way
> > to go. Meanwhile, my understanding is that we need to be able to support
> > more of the ECP application projects to justify the substantial funding we
> > are getting from the program. Many of these projects are dead-set on using
> > OpenMP. (I note that I believe that the folks Mark is trying to help with
> > PETSc and OpenMP are people affiliated with Carl Steefel's ECP subsurface
> > project.)
> >
> > Since it looks like MPI endpoints are going to be a long time (or possibly
> > forever) in coming, I think we need (a) stopgap plan(s) to support this
> > crappy MPI + OpenMP model in the meantime. One possible approach is to do
> > what Mark is trying with to do with MKL: Use a third party library that
> > provides optimized OpenMP implementations of computationally expensive
> > kernels. It might make sense to also consider using Karl's ViennaCL library
> > in this manner, which we already use to support GPUs, but which I believe
> > (Karl, please let me know if I am off-base here) we could also use to
> > provide OpenMP-ized linear algebra operations on CPUs as well. Such
> > approaches won't use threads for lots of the things that a PETSc code will
> > do, but might be able to provide decent resource utilization for the most
> > expensive parts for some codes.
> >
> > Clever ideas from anyone on this list about how to use an adequate number
> > of MPI ranks for PETSc while using only a subset of these ranks for the
> > MPI+OpenMP application code will be appreciated, though I don't know if
> > there are any good solutions.
> >
> > --Richard
> >
> > On Wed, Jul 4, 2018 at 11:38 PM, Smith, Barry F. <[email protected]> wrote:
> >
> > Jed,
> >
> > You could use your same argument to argue PETSc should do "something"
> > to help people who have (rightly or wrongly) chosen to code their
> > application in High Performance Fortran or any other similar inane parallel
> > programming model.
> >
> > Barry
> >
> >
> >
> > > On Jul 4, 2018, at 11:51 PM, Jed Brown <[email protected]> wrote:
> > >
> > > Matthew Knepley <[email protected]> writes:
> > >
> > >> On Wed, Jul 4, 2018 at 4:51 PM Jeff Hammond <[email protected]>
> > >> wrote:
> > >>
> > >>> On Wed, Jul 4, 2018 at 6:31 AM Matthew Knepley <[email protected]>
> > >>> wrote:
> > >>>
> > >>>> On Tue, Jul 3, 2018 at 10:32 PM Jeff Hammond <[email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Jul 3, 2018 at 4:35 PM Mark Adams <[email protected]> wrote:
> > >>>>>
> > >>>>>> On Tue, Jul 3, 2018 at 1:00 PM Richard Tran Mills <[email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Mark,
> > >>>>>>>
> > >>>>>>> I'm glad to see you trying out the AIJMKL stuff. I think you are the
> > >>>>>>> first person trying to actually use it, so we are probably going to
> > >>>>>>> expose
> > >>>>>>> some bugs and also some performance issues. My somewhat limited
> > >>>>>>> testing has
> > >>>>>>> shown that the MKL sparse routines often perform worse than our own
> > >>>>>>> implementations in PETSc.
> > >>>>>>>
> > >>>>>>
> > >>>>>> My users just want OpenMP.
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> Why not just add OpenMP to PETSc? I know certain developers hate it,
> > >>>>> but
> > >>>>> it is silly to let a principled objection stand in the way of
> > >>>>> enabling users
> > >>>>>
> > >>>>
> > >>>> "if that would deliver the best performance for NERSC users."
> > >>>>
> > >>>> You have answered your own question.
> > >>>>
> > >>>
> > >>> Please share the results of your experiments that prove OpenMP does not
> > >>> improve performance for Mark’s users.
> > >>>
> > >>
> > >> Oh God. I am supremely uninterested in minutely proving yet again that
> > >> OpenMP is not better than MPI.
> > >> There are already countless experiments. One more will not add anything
> > >> of
> > >> merit.
> > >
> > > Jeff assumes an absurd null hypothesis, Matt selfishly believes that
> > > users should modify their code/execution environment to subscribe to a
> > > more robust and equally performant approach, and the MPI forum abdicates
> > > by stalling on endpoints. How do we resolve this?
> > >
> > >>> Also we are not in the habit of fucking up our codebase in order to
> > >>> follow
> > >>>> some fad.
> > >>>>
> > >>>
> > >>> If you can’t use OpenMP without messing up your code base, you probably
> > >>> don’t know how to design software.
> > >>>
> > >>
> > >> That is an interesting, if wrong, opinion. It might be your contention
> > >> that
> > >> sticking any random paradigm in a library should
> > >> be alright if its "well designed"? I have never encountered such a
> > >> well-designed library.
> > >>
> > >>
> > >>> I guess if you refuse to use _Pragma because C99 is still a fad for you,
> > >>> it is harder, but clearly _Complex is tolerated.
> > >>>
> > >>
> > >> Yes, littering your code with preprocessor directives improves almost
> > >> everything. Doing proper resource management
> > >> using Pragmas, in an environment with several layers of libraries, is a
> > >> dream.
> > >>
> > >>
> > >>> More seriously, you’ve adopted OpenMP hidden behind MKL
> > >>>
> > >>
> > >> Nope. We can use MKL with that crap shutoff.
> > >>
> > >>
> > >>> so I see no reason why you can’t wrap OpenMP implementations of the
> > >>> PETSc
> > >>> sparse kernels in a similar manner.
> > >>>
> > >>
> > >> We could, its just a colossal waste of time and effort, as well as
> > >> counterproductive for the codebase :)
> > >
> > > Endpoints either need to become a thing we can depend on or we need a
> > > solution for users that insist on using threads (even if their decision
> > > to use threads is objectively bad). The problem Matt harps on is
> > > legitimate: OpenMP parallel regions cannot reliably cross module
> > > boundaries except for embarrassingly parallel operations. This means
> > > loop-level omp parallel which significantly increases overhead for small
> > > problem sizes (e.g., slowing coarse grid solves and strong scaling
> > > limits). It can be done and isn't that hard, but the Imperial group
> > > discarded their branch after observing that it also provided no
> > > performance benefit. However, I'm coming around to the idea that PETSc
> > > should do it so that there is _a_ solution for users that insist on
> > > using threads in a particular way. Unless Endpoints become available
> > > and reliable, in which case we could do it right.
> >
> >
>
>
>
>
> --
> Jeff Hammond
> [email protected]
> http://jeffhammond.github.io/