Have anyone of you tried to reproduce this problem? Thomas
Am 21.12.2012 22:05, schrieb Thomas Witkowski: > So, here it is. Just compile and run with > > mpiexec -np 64 ./ex10 -ksp_type preonly -pc_type lu > -pc_factor_mat_solver_package superlu_dist -log_summary > > 64 cores: 0.09 seconds for solving > 1024 cores: 2.6 seconds for solving > > Thomas > > > Zitat von Jed Brown <jedbrown at mcs.anl.gov>: > >> Can you reproduce this in a simpler environment so that we can report >> it? >> As I understand your statement, it sounds like you could reproduce by >> changing src/ksp/ksp/examples/tutorials/ex10.c to create a subcomm of >> size >> 4 and the using that everywhere, then compare log_summary running on 4 >> cores to running on more (despite everything really being independent) >> >> It would also be worth using an MPI profiler to see if it's really >> spending >> a lot of time in MPI_Iprobe. Since SuperLU_DIST does not use >> MPI_Iprobe, it >> may be something else. >> >> On Fri, Dec 21, 2012 at 8:51 AM, Thomas Witkowski < >> Thomas.Witkowski at tu-dresden.de> wrote: >> >>> I use a modified MPICH version. On the system I use for these >>> benchmarks I >>> cannot use another MPI library. >>> >>> I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly >>> for this. But there is still the following problem I cannot solve: >>> When I >>> increase the number of coarse space matrices, there seems to be no >>> scaling >>> direct solver for this. Just to summaries: >>> - one coarse space matrix is created always by one "cluster" >>> consisting of >>> four subdomanins/MPI tasks >>> - the four tasks are always local to one node, thus inter-node network >>> communication is not required for computing factorization and solve >>> - independent of the number of cluster, the coarse space matrices >>> are the >>> same, have the same number of rows, nnz structure but possibly >>> different >>> values >>> - there is NO load unbalancing >>> - the matrices must be factorized and there are a lot of solves (> 100) >>> with them >>> >>> It should be pretty clear, that computing LU factorization and solving >>> with it should scale perfectly. But at the moment, all direct solver I >>> tried (mumps, superlu_dist, pastix) are not able to scale. The loos of >>> scale is really worse, as you can see from the numbers I send before. >>> >>> Any ideas? Suggestions? Without a scaling solver method for these >>> kind of >>> systems, my multilevel FETI-DP code is just more or less a joke, >>> only some >>> orders of magnitude slower than standard FETI-DP method :) >>> >>> Thomas >>> >>> Zitat von Jed Brown <jedbrown at mcs.anl.gov>: >>> >>> MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI >>>> implementation have you been using? Is the behavior different with a >>>> different implementation? >>>> >>>> >>>> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski < >>>> thomas.witkowski at tu-dresden.de**> wrote: >>>> >>>> Okay, I did a similar benchmark now with PETSc's event logging: >>>>> >>>>> UMFPACK >>>>> 16p: Local solve 350 1.0 2.3025e+01 1.1 5.00e+04 1.0 >>>>> 0.0e+00 >>>>> 0.0e+00 7.0e+02 63 0 0 0 52 63 0 0 0 51 0 >>>>> 64p: Local solve 350 1.0 2.3208e+01 1.1 5.00e+04 1.0 >>>>> 0.0e+00 >>>>> 0.0e+00 7.0e+02 60 0 0 0 52 60 0 0 0 51 0 >>>>> 256p: Local solve 350 1.0 2.3373e+01 1.1 5.00e+04 1.0 >>>>> 0.0e+00 >>>>> 0.0e+00 7.0e+02 49 0 0 0 52 49 0 0 0 51 1 >>>>> >>>>> MUMPS >>>>> 16p: Local solve 350 1.0 4.7183e+01 1.1 5.00e+04 1.0 >>>>> 0.0e+00 >>>>> 0.0e+00 7.0e+02 75 0 0 0 52 75 0 0 0 51 0 >>>>> 64p: Local solve 350 1.0 7.1409e+01 1.1 5.00e+04 1.0 >>>>> 0.0e+00 >>>>> 0.0e+00 7.0e+02 78 0 0 0 52 78 0 0 0 51 0 >>>>> 256p: Local solve 350 1.0 2.6079e+02 1.1 5.00e+04 1.0 >>>>> 0.0e+00 >>>>> 0.0e+00 7.0e+02 82 0 0 0 52 82 0 0 0 51 0 >>>>> >>>>> >>>>> As you see, the local solves with UMFPACK have nearly constant >>>>> time with >>>>> increasing number of subdomains. This is what I expect. The I replace >>>>> UMFPACK by MUMPS and I see increasing time for local solves. In >>>>> the last >>>>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's >>>>> column >>>>> increases here from 75 to 82. What does this mean? >>>>> >>>>> Thomas >>>>> >>>>> Am 21.12.2012 02:19, schrieb Matthew Knepley: >>>>> >>>>> On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski >>>>> >>>>>> <Thomas.Witkowski at tu-dresden.****de >>>>>> <Thomas.Witkowski at tu-dresden.**de<Thomas.Witkowski at tu-dresden.de> >>>>>> >> >>>>>> >>>>>> wrote: >>>>>> >>>>>> I cannot use the information from log_summary, as I have three >>>>>>> different >>>>>>> LU >>>>>>> factorizations and solve (local matrices and two hierarchies of >>>>>>> coarse >>>>>>> grids). Therefore, I use the following work around to get the >>>>>>> timing of >>>>>>> the >>>>>>> solve I'm intrested in: >>>>>>> >>>>>>> You misunderstand how to use logging. You just put these thing in >>>>>> separate stages. Stages represent >>>>>> parts of the code over which events are aggregated. >>>>>> >>>>>> Matt >>>>>> >>>>>> MPI::COMM_WORLD.Barrier(); >>>>>> >>>>>>> wtime = MPI::Wtime(); >>>>>>> KSPSolve(*(data->ksp_schur_****primal_local), tmp_primal, >>>>>>> >>>>>>> tmp_primal); >>>>>>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime); >>>>>>> >>>>>>> The factorization is done explicitly before with "KSPSetUp", so >>>>>>> I can >>>>>>> measure the time for LU factorization. It also does not scale! >>>>>>> For 64 >>>>>>> cores, >>>>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all >>>>>>> calculations, >>>>>>> the >>>>>>> local coarse space matrices defined on four cores have exactly >>>>>>> the same >>>>>>> number of rows and exactly the same number of non zero entries. So, >>>>>>> from >>>>>>> my >>>>>>> point of view, the time should be absolutely constant. >>>>>>> >>>>>>> Thomas >>>>>>> >>>>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>: >>>>>>> >>>>>>> >>>>>>> Are you timing ONLY the time to factor and solve the >>>>>>> subproblems? >>>>>>> Or >>>>>>> >>>>>>>> also the time to get the data to the collection of 4 cores at >>>>>>>> a time? >>>>>>>> >>>>>>>> If you are only using LU for these problems and not >>>>>>>> elsewhere in >>>>>>>> the >>>>>>>> code you can get the factorization and time from MatLUFactor() >>>>>>>> and >>>>>>>> MatSolve() or you can use stages to put this calculation in >>>>>>>> its own >>>>>>>> stage >>>>>>>> and use the MatLUFactor() and MatSolve() time from that stage. >>>>>>>> Also look at the load balancing column for the factorization and >>>>>>>> solve >>>>>>>> stage, it is well balanced? >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski >>>>>>>> <thomas.witkowski at tu-dresden.****de >>>>>>>> <thomas.witkowski at tu-dresden.**de<thomas.witkowski at tu-dresden.de> >>>>>>>> >> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> In my multilevel FETI-DP code, I have localized course matrices, >>>>>>>> which >>>>>>>> >>>>>>>>> are defined on only a subset of all MPI tasks, typically >>>>>>>>> between 4 >>>>>>>>> and 64 >>>>>>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI >>>>>>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU >>>>>>>>> factorization of >>>>>>>>> the matrices is computed with either MUMPS or superlu_dist, >>>>>>>>> but both >>>>>>>>> show >>>>>>>>> some scaling property I really wonder of: When the overall >>>>>>>>> problem >>>>>>>>> size is >>>>>>>>> increased, the solve with the LU factorization of the local >>>>>>>>> matrices >>>>>>>>> does >>>>>>>>> not scale! But why not? I just increase the number of local >>>>>>>>> matrices, >>>>>>>>> but >>>>>>>>> all of them are independent of each other. Some example: I use 64 >>>>>>>>> cores, >>>>>>>>> each coarse matrix is spanned by 4 cores so there are 16 MPI >>>>>>>>> communicators >>>>>>>>> with 16 coarse space matrices. The problem need to solve 192 >>>>>>>>> times >>>>>>>>> with the >>>>>>>>> coarse space systems, and this takes together 0.09 seconds. >>>>>>>>> Now I >>>>>>>>> increase >>>>>>>>> the number of cores to 256, but let the local coarse space be >>>>>>>>> defined >>>>>>>>> again >>>>>>>>> on only 4 cores. Again, 192 solutions with these coarse >>>>>>>>> spaces are >>>>>>>>> required, but now this takes 0.24 seconds. The same for 1024 >>>>>>>>> cores, >>>>>>>>> and we >>>>>>>>> are at 1.7 seconds for the local coarse space solver! >>>>>>>>> >>>>>>>>> For me, this is a total mystery! Any idea how to explain, >>>>>>>>> debug and >>>>>>>>> eventually how to resolve this problem? >>>>>>>>> >>>>>>>>> Thomas >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments is infinitely more interesting than any results to which >>>>>> their experiments lead. >>>>>> -- Norbert Wiener >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > >
