MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI implementation have you been using? Is the behavior different with a different implementation?
On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski < thomas.witkowski at tu-dresden.de> wrote: > Okay, I did a similar benchmark now with PETSc's event logging: > > UMFPACK > 16p: Local solve 350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00 > 0.0e+00 7.0e+02 63 0 0 0 52 63 0 0 0 51 0 > 64p: Local solve 350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00 > 0.0e+00 7.0e+02 60 0 0 0 52 60 0 0 0 51 0 > 256p: Local solve 350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00 > 0.0e+00 7.0e+02 49 0 0 0 52 49 0 0 0 51 1 > > MUMPS > 16p: Local solve 350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00 > 0.0e+00 7.0e+02 75 0 0 0 52 75 0 0 0 51 0 > 64p: Local solve 350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00 > 0.0e+00 7.0e+02 78 0 0 0 52 78 0 0 0 51 0 > 256p: Local solve 350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00 > 0.0e+00 7.0e+02 82 0 0 0 52 82 0 0 0 51 0 > > > As you see, the local solves with UMFPACK have nearly constant time with > increasing number of subdomains. This is what I expect. The I replace > UMFPACK by MUMPS and I see increasing time for local solves. In the last > columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's column > increases here from 75 to 82. What does this mean? > > Thomas > > Am 21.12.2012 02:19, schrieb Matthew Knepley: > > On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski >> <Thomas.Witkowski at tu-dresden.**de <Thomas.Witkowski at tu-dresden.de>> >> wrote: >> >>> I cannot use the information from log_summary, as I have three different >>> LU >>> factorizations and solve (local matrices and two hierarchies of coarse >>> grids). Therefore, I use the following work around to get the timing of >>> the >>> solve I'm intrested in: >>> >> You misunderstand how to use logging. You just put these thing in >> separate stages. Stages represent >> parts of the code over which events are aggregated. >> >> Matt >> >> MPI::COMM_WORLD.Barrier(); >>> wtime = MPI::Wtime(); >>> KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal, >>> tmp_primal); >>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime); >>> >>> The factorization is done explicitly before with "KSPSetUp", so I can >>> measure the time for LU factorization. It also does not scale! For 64 >>> cores, >>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations, >>> the >>> local coarse space matrices defined on four cores have exactly the same >>> number of rows and exactly the same number of non zero entries. So, from >>> my >>> point of view, the time should be absolutely constant. >>> >>> Thomas >>> >>> Zitat von Barry Smith <bsmith at mcs.anl.gov>: >>> >>> >>> Are you timing ONLY the time to factor and solve the subproblems? Or >>>> also the time to get the data to the collection of 4 cores at a time? >>>> >>>> If you are only using LU for these problems and not elsewhere in >>>> the >>>> code you can get the factorization and time from MatLUFactor() and >>>> MatSolve() or you can use stages to put this calculation in its own >>>> stage >>>> and use the MatLUFactor() and MatSolve() time from that stage. >>>> Also look at the load balancing column for the factorization and solve >>>> stage, it is well balanced? >>>> >>>> Barry >>>> >>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski >>>> <thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>> >>>> wrote: >>>> >>>> In my multilevel FETI-DP code, I have localized course matrices, which >>>>> are defined on only a subset of all MPI tasks, typically between 4 >>>>> and 64 >>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI >>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU >>>>> factorization of >>>>> the matrices is computed with either MUMPS or superlu_dist, but both >>>>> show >>>>> some scaling property I really wonder of: When the overall problem >>>>> size is >>>>> increased, the solve with the LU factorization of the local matrices >>>>> does >>>>> not scale! But why not? I just increase the number of local matrices, >>>>> but >>>>> all of them are independent of each other. Some example: I use 64 >>>>> cores, >>>>> each coarse matrix is spanned by 4 cores so there are 16 MPI >>>>> communicators >>>>> with 16 coarse space matrices. The problem need to solve 192 times >>>>> with the >>>>> coarse space systems, and this takes together 0.09 seconds. Now I >>>>> increase >>>>> the number of cores to 256, but let the local coarse space be defined >>>>> again >>>>> on only 4 cores. Again, 192 solutions with these coarse spaces are >>>>> required, but now this takes 0.24 seconds. The same for 1024 cores, >>>>> and we >>>>> are at 1.7 seconds for the local coarse space solver! >>>>> >>>>> For me, this is a total mystery! Any idea how to explain, debug and >>>>> eventually how to resolve this problem? >>>>> >>>>> Thomas >>>>> >>>> >>>> >>>> >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which >> their experiments lead. >> -- Norbert Wiener >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121221/8968e584/attachment-0001.html>
