> On Jul 9, 2018, at 8:33 AM, Jeff Hammond <jeff.scie...@gmail.com> wrote:
> 
> 
> 
> On Fri, Jul 6, 2018 at 4:28 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> 
>   Richard,
> 
>     The problem is that OpenMP is too large and has too many different 
> programming models imbedded in it (and it will get worse) to "support OpenMP" 
> from PETSc.
> 
> This is also true of MPI.  You can write CSP, BSP, PGAS, fork-join, 
> agent-based, etc. in MPI.  Just like MPI, you don't have to use all the 
> features.  PETSc doesn't use MPI_Comm_spawn, MPI_Rget_accumulate, or 
> MPI_Neighborhood_alltoallv, does it?
>  
>     One way to use #pragma based optimization tools (which is one way to  
> treat OpenMP) is to run the application code in a realistic size problem, 
> using the number of threads/MPI process they prefer with profiling and begin 
> adding #pragmas to the most time consuming code fragments/routines, measuring 
> the (small) improvement in performance as they are added. This is the way I 
> would proceed. The branch generated will not have very many pragmas in it so 
> would likely be acceptable to be included into PETSc. It would also give a 
> quantitative measure of the possible performance with the #prama approach.
> 
> This is the textbook Wrong Way to write OpenMP and the reason that the 
> thread-scalability of DOE applications using MPI+OpenMP sucks.  It leads to 
> codes that do fork-join far too often and suffer from death by Amdahl, unless 
> you do a second pass where you fuse all the OpenMP regions and replace the 
> serial regions between them with critical sections or similar.
> 
> This isn't how you'd write MPI, is it?  No, you'd figure out how to decompose 
> your data properly to exploit locality and then implement an algorithm that 
> minimizes communication and synchronization.  Do that with OpenMP.
>  
>    Note: that for BLAS 1 operations likely the correct thing to do is turn on 
> MKL BLAS threading (being careful to make sure the number of threads MKL uses 
> matches that used by other parts of the code). This way we don't need to 
> OpenMP optimize many parts of PETSc's vector operations (norm, dot, scale, 
> axpy). In fact, this is the first thing Mark should do, how much does it 
> speed up the vector operations?
> 
> BLAS1 operations are all memory-bound unless running out of cache (in which 
> case one shouldn't use threads) and compilers do a great job with them.  Just 
> put the pragmas on and let the compiler do its job.

    PETSc currently calls BLAS for these operations so using threaded blas is a 
 natural approach rather than needing to provide new handwritten kernels to add 
the #pragmas to.

   Barry

>  
>   The problem is how many ECP applications actually use OpenMP just as a 
> #pragma optimization tool, or do they use other features of OpenMP. For 
> example I remember Brian wanted to/did use OpenMP threads directly in BoxLib 
> and didn't just stick to the #pragma model. If they did this then we would 
> need custom PETSc to match their model.
> 
> If this implies that BoxLib will use omp-parallel and then use explicit 
> threading in a manner similar to MPI (omp_get_num_threads=MPI_Comm_size and 
> omp_get_thread_num=MPI_Comm_rank), then this is the Right Way to write OpenMP.
> 
> Unfortunately, the Right Way to use OpenMP makes it hard to use MPI unless 
> you use MPI_THREAD_MULTIPLE and endpoints.  ECP projects should be pushing 
> the MPI folks harder to ratify and implement endpoints.  I don't know if the 
> proposal is even active right now, but that doesn't prevent DOE from 
> compelling Open-MPI and MPICH to support it.
> 
> To end on a positive note, OpenMP tasking is a relatively composable model 
> and supports DAG-based parallelism.  I suspect the initial results in a code 
> like PETSc will be worse than with traditional implicit OpenMP (omp-for-simd 
> on all the loops) but it eventually wins own because it doesn't require any 
> unnecessary barriers and makes it much easier to fuse parallel regions.
> 
> Jeff
>  
> 
>   Barry
> 
> 
> > On Jul 6, 2018, at 3:07 PM, Mills, Richard Tran <rtmi...@anl.gov> wrote:
> > 
> > True, Barry. But, unfortunately, I think Jed's argument has something to it 
> > because the hybrid MPI + OpenMP model has become so popular. I know of a 
> > few codes where adopting this model makes some sense, though I believe 
> > that, more often, the model has been adopted simply because it is the 
> > fashionable thing to do. Regardless of good or bad reasons for its 
> > adoption, I do have some real concern that codes that use this model have a 
> > difficult time using PETSc effectively because of the lack of thread 
> > support. Like many of us, I had hoped that endpoints would make it into the 
> > MPI standard and this would provide a reasonable mechanism for integrating 
> > PETSc with codes using MPI+threads, but progress on this seems to have 
> > stagnated. I hope that the MPI endpoints effort eventually goes somewhere, 
> > but what can we do in the meantime? Within the DOE ECP program, the 
> > MPI+threads approach is being pushed really hard, and many of the ECP 
> > subprojects have adopted it. I think it's mostly idiotic, but I think it's 
> > too late to turn the tide and convince most people that pure MPI is the way 
> > to go. Meanwhile, my understanding is that we need to be able to support 
> > more of the ECP application projects to justify the substantial funding we 
> > are getting from the program. Many of these projects are dead-set on using 
> > OpenMP. (I note that I believe that the folks Mark is trying to help with 
> > PETSc and OpenMP are people affiliated with Carl Steefel's ECP subsurface 
> > project.)
> > 
> > Since it looks like MPI endpoints are going to be a long time (or possibly 
> > forever) in coming, I think we need (a) stopgap plan(s) to support this 
> > crappy MPI + OpenMP model in the meantime. One possible approach is to do 
> > what Mark is trying with to do with MKL: Use a third party library that 
> > provides optimized OpenMP implementations of computationally expensive 
> > kernels. It might make sense to also consider using Karl's ViennaCL library 
> > in this manner, which we already use to support GPUs, but which I believe 
> > (Karl, please let me know if I am off-base here) we could also use to 
> > provide OpenMP-ized linear algebra operations on CPUs as well. Such 
> > approaches won't use threads for lots of the things that a PETSc code will 
> > do, but might be able to provide decent resource utilization for the most 
> > expensive parts for some codes.
> > 
> > Clever ideas from anyone on this list about how to use an adequate number 
> > of MPI ranks for PETSc while using only a subset of these ranks for the 
> > MPI+OpenMP application code will be appreciated, though I don't know if 
> > there are any good solutions.
> > 
> > --Richard
> > 
> > On Wed, Jul 4, 2018 at 11:38 PM, Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> > 
> >    Jed,
> > 
> >      You could use your same argument to argue PETSc should do "something" 
> > to help people who have (rightly or wrongly) chosen to code their 
> > application in High Performance Fortran or any other similar inane parallel 
> > programming model.
> > 
> >    Barry
> > 
> > 
> > 
> > > On Jul 4, 2018, at 11:51 PM, Jed Brown <j...@jedbrown.org> wrote:
> > > 
> > > Matthew Knepley <knep...@gmail.com> writes:
> > > 
> > >> On Wed, Jul 4, 2018 at 4:51 PM Jeff Hammond <jeff.scie...@gmail.com> 
> > >> wrote:
> > >> 
> > >>> On Wed, Jul 4, 2018 at 6:31 AM Matthew Knepley <knep...@gmail.com> 
> > >>> wrote:
> > >>> 
> > >>>> On Tue, Jul 3, 2018 at 10:32 PM Jeff Hammond <jeff.scie...@gmail.com>
> > >>>> wrote:
> > >>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> On Tue, Jul 3, 2018 at 4:35 PM Mark Adams <mfad...@lbl.gov> wrote:
> > >>>>> 
> > >>>>>> On Tue, Jul 3, 2018 at 1:00 PM Richard Tran Mills <rtmi...@anl.gov>
> > >>>>>> wrote:
> > >>>>>> 
> > >>>>>>> Hi Mark,
> > >>>>>>> 
> > >>>>>>> I'm glad to see you trying out the AIJMKL stuff. I think you are the
> > >>>>>>> first person trying to actually use it, so we are probably going to 
> > >>>>>>> expose
> > >>>>>>> some bugs and also some performance issues. My somewhat limited 
> > >>>>>>> testing has
> > >>>>>>> shown that the MKL sparse routines often perform worse than our own
> > >>>>>>> implementations in PETSc.
> > >>>>>>> 
> > >>>>>> 
> > >>>>>> My users just want OpenMP.
> > >>>>>> 
> > >>>>>> 
> > >>>>> 
> > >>>>> Why not just add OpenMP to PETSc? I know certain developers hate it, 
> > >>>>> but
> > >>>>> it is silly to let a principled objection stand in the way of 
> > >>>>> enabling users
> > >>>>> 
> > >>>> 
> > >>>> "if that would deliver the best performance for NERSC users."
> > >>>> 
> > >>>> You have answered your own question.
> > >>>> 
> > >>> 
> > >>> Please share the results of your experiments that prove OpenMP does not
> > >>> improve performance for Mark’s users.
> > >>> 
> > >> 
> > >> Oh God. I am supremely uninterested in minutely proving yet again that
> > >> OpenMP is not better than MPI.
> > >> There are already countless experiments. One more will not add anything 
> > >> of
> > >> merit.
> > > 
> > > Jeff assumes an absurd null hypothesis, Matt selfishly believes that
> > > users should modify their code/execution environment to subscribe to a
> > > more robust and equally performant approach, and the MPI forum abdicates
> > > by stalling on endpoints.  How do we resolve this?
> > > 
> > >>> Also we are not in the habit of fucking up our codebase in order to 
> > >>> follow
> > >>>> some fad.
> > >>>> 
> > >>> 
> > >>> If you can’t use OpenMP without messing up your code base, you probably
> > >>> don’t know how to design software.
> > >>> 
> > >> 
> > >> That is an interesting, if wrong, opinion. It might be your contention 
> > >> that
> > >> sticking any random paradigm in a library should
> > >> be alright if its "well designed"? I have never encountered such a
> > >> well-designed library.
> > >> 
> > >> 
> > >>> I guess if you refuse to use _Pragma because C99 is still a fad for you,
> > >>> it is harder, but clearly _Complex is tolerated.
> > >>> 
> > >> 
> > >> Yes, littering your code with preprocessor directives improves almost
> > >> everything. Doing proper resource management
> > >> using Pragmas, in an environment with several layers of libraries, is a
> > >> dream.
> > >> 
> > >> 
> > >>> More seriously, you’ve adopted OpenMP hidden behind MKL
> > >>> 
> > >> 
> > >> Nope. We can use MKL with that crap shutoff.
> > >> 
> > >> 
> > >>> so I see no reason why you can’t wrap OpenMP implementations of the 
> > >>> PETSc
> > >>> sparse kernels in a similar manner.
> > >>> 
> > >> 
> > >> We could, its just a colossal waste of time and effort, as well as
> > >> counterproductive for the codebase :)
> > > 
> > > Endpoints either need to become a thing we can depend on or we need a
> > > solution for users that insist on using threads (even if their decision
> > > to use threads is objectively bad).  The problem Matt harps on is
> > > legitimate: OpenMP parallel regions cannot reliably cross module
> > > boundaries except for embarrassingly parallel operations.  This means
> > > loop-level omp parallel which significantly increases overhead for small
> > > problem sizes (e.g., slowing coarse grid solves and strong scaling
> > > limits).  It can be done and isn't that hard, but the Imperial group
> > > discarded their branch after observing that it also provided no
> > > performance benefit.  However, I'm coming around to the idea that PETSc
> > > should do it so that there is _a_ solution for users that insist on
> > > using threads in a particular way.  Unless Endpoints become available
> > > and reliable, in which case we could do it right.
> > 
> > 
> 
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/

Reply via email to