On Fri, Dec 21, 2012 at 10:51 AM, Thomas Witkowski <Thomas.Witkowski at tu-dresden.de> wrote: > I use a modified MPICH version. On the system I use for these benchmarks I > cannot use another MPI library. > > I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly for > this. But there is still the following problem I cannot solve: When I > increase the number of coarse space matrices, there seems to be no scaling > direct solver for this. Just to summaries: > - one coarse space matrix is created always by one "cluster" consisting of > four subdomanins/MPI tasks > - the four tasks are always local to one node, thus inter-node network > communication is not required for computing factorization and solve > - independent of the number of cluster, the coarse space matrices are the > same, have the same number of rows, nnz structure but possibly different > values > - there is NO load unbalancing > - the matrices must be factorized and there are a lot of solves (> 100) with > them
So the numbers you have below for UMFPACK are using one matrix per MPI rank instead of one matrix per 4 ranks? There seem to be two obvious sources of bugs: 1) Your parallel solver is not just using the comm with 4 ranks 2) These ranks are not clustered together on one node for that comm Matt > It should be pretty clear, that computing LU factorization and solving with > it should scale perfectly. But at the moment, all direct solver I tried > (mumps, superlu_dist, pastix) are not able to scale. The loos of scale is > really worse, as you can see from the numbers I send before. > > Any ideas? Suggestions? Without a scaling solver method for these kind of > systems, my multilevel FETI-DP code is just more or less a joke, only some > orders of magnitude slower than standard FETI-DP method :) > > Thomas > > Zitat von Jed Brown <jedbrown at mcs.anl.gov>: > >> MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI >> implementation have you been using? Is the behavior different with a >> different implementation? >> >> >> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski < >> thomas.witkowski at tu-dresden.de> wrote: >> >>> Okay, I did a similar benchmark now with PETSc's event logging: >>> >>> UMFPACK >>> 16p: Local solve 350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00 >>> 0.0e+00 7.0e+02 63 0 0 0 52 63 0 0 0 51 0 >>> 64p: Local solve 350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00 >>> 0.0e+00 7.0e+02 60 0 0 0 52 60 0 0 0 51 0 >>> 256p: Local solve 350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00 >>> 0.0e+00 7.0e+02 49 0 0 0 52 49 0 0 0 51 1 >>> >>> MUMPS >>> 16p: Local solve 350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00 >>> 0.0e+00 7.0e+02 75 0 0 0 52 75 0 0 0 51 0 >>> 64p: Local solve 350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00 >>> 0.0e+00 7.0e+02 78 0 0 0 52 78 0 0 0 51 0 >>> 256p: Local solve 350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00 >>> 0.0e+00 7.0e+02 82 0 0 0 52 82 0 0 0 51 0 >>> >>> >>> As you see, the local solves with UMFPACK have nearly constant time with >>> increasing number of subdomains. This is what I expect. The I replace >>> UMFPACK by MUMPS and I see increasing time for local solves. In the last >>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's >>> column >>> increases here from 75 to 82. What does this mean? >>> >>> Thomas >>> >>> Am 21.12.2012 02:19, schrieb Matthew Knepley: >>> >>> On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski >>>> >>>> <Thomas.Witkowski at tu-dresden.**de <Thomas.Witkowski at tu-dresden.de>> >>>> >>>> wrote: >>>> >>>>> I cannot use the information from log_summary, as I have three >>>>> different >>>>> LU >>>>> factorizations and solve (local matrices and two hierarchies of coarse >>>>> grids). Therefore, I use the following work around to get the timing of >>>>> the >>>>> solve I'm intrested in: >>>>> >>>> You misunderstand how to use logging. You just put these thing in >>>> separate stages. Stages represent >>>> parts of the code over which events are aggregated. >>>> >>>> Matt >>>> >>>> MPI::COMM_WORLD.Barrier(); >>>>> >>>>> wtime = MPI::Wtime(); >>>>> KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal, >>>>> >>>>> tmp_primal); >>>>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime); >>>>> >>>>> The factorization is done explicitly before with "KSPSetUp", so I can >>>>> measure the time for LU factorization. It also does not scale! For 64 >>>>> cores, >>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations, >>>>> the >>>>> local coarse space matrices defined on four cores have exactly the same >>>>> number of rows and exactly the same number of non zero entries. So, >>>>> from >>>>> my >>>>> point of view, the time should be absolutely constant. >>>>> >>>>> Thomas >>>>> >>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>: >>>>> >>>>> >>>>> Are you timing ONLY the time to factor and solve the subproblems? >>>>> Or >>>>>> >>>>>> also the time to get the data to the collection of 4 cores at a time? >>>>>> >>>>>> If you are only using LU for these problems and not elsewhere in >>>>>> the >>>>>> code you can get the factorization and time from MatLUFactor() and >>>>>> MatSolve() or you can use stages to put this calculation in its own >>>>>> stage >>>>>> and use the MatLUFactor() and MatSolve() time from that stage. >>>>>> Also look at the load balancing column for the factorization and >>>>>> solve >>>>>> stage, it is well balanced? >>>>>> >>>>>> Barry >>>>>> >>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski >>>>>> <thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>> >>>>>> >>>>>> wrote: >>>>>> >>>>>> In my multilevel FETI-DP code, I have localized course matrices, >>>>>> which >>>>>>> >>>>>>> are defined on only a subset of all MPI tasks, typically between 4 >>>>>>> and 64 >>>>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI >>>>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU >>>>>>> factorization of >>>>>>> the matrices is computed with either MUMPS or superlu_dist, but both >>>>>>> show >>>>>>> some scaling property I really wonder of: When the overall problem >>>>>>> size is >>>>>>> increased, the solve with the LU factorization of the local matrices >>>>>>> does >>>>>>> not scale! But why not? I just increase the number of local >>>>>>> matrices, >>>>>>> but >>>>>>> all of them are independent of each other. Some example: I use 64 >>>>>>> cores, >>>>>>> each coarse matrix is spanned by 4 cores so there are 16 MPI >>>>>>> communicators >>>>>>> with 16 coarse space matrices. The problem need to solve 192 times >>>>>>> with the >>>>>>> coarse space systems, and this takes together 0.09 seconds. Now I >>>>>>> increase >>>>>>> the number of cores to 256, but let the local coarse space be >>>>>>> defined >>>>>>> again >>>>>>> on only 4 cores. Again, 192 solutions with these coarse spaces are >>>>>>> required, but now this takes 0.24 seconds. The same for 1024 cores, >>>>>>> and we >>>>>>> are at 1.7 seconds for the local coarse space solver! >>>>>>> >>>>>>> For me, this is a total mystery! Any idea how to explain, debug and >>>>>>> eventually how to resolve this problem? >>>>>>> >>>>>>> Thomas >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which >>>> their experiments lead. >>>> -- Norbert Wiener >>>> >>> >>> >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
