So, here it is. Just compile and run with mpiexec -np 64 ./ex10 -ksp_type preonly -pc_type lu -pc_factor_mat_solver_package superlu_dist -log_summary
64 cores: 0.09 seconds for solving 1024 cores: 2.6 seconds for solving Thomas Zitat von Jed Brown <jedbrown at mcs.anl.gov>: > Can you reproduce this in a simpler environment so that we can report it? > As I understand your statement, it sounds like you could reproduce by > changing src/ksp/ksp/examples/tutorials/ex10.c to create a subcomm of size > 4 and the using that everywhere, then compare log_summary running on 4 > cores to running on more (despite everything really being independent) > > It would also be worth using an MPI profiler to see if it's really spending > a lot of time in MPI_Iprobe. Since SuperLU_DIST does not use MPI_Iprobe, it > may be something else. > > On Fri, Dec 21, 2012 at 8:51 AM, Thomas Witkowski < > Thomas.Witkowski at tu-dresden.de> wrote: > >> I use a modified MPICH version. On the system I use for these benchmarks I >> cannot use another MPI library. >> >> I'm not fixed to MUMPS. Superlu_dist, for example, works also perfectly >> for this. But there is still the following problem I cannot solve: When I >> increase the number of coarse space matrices, there seems to be no scaling >> direct solver for this. Just to summaries: >> - one coarse space matrix is created always by one "cluster" consisting of >> four subdomanins/MPI tasks >> - the four tasks are always local to one node, thus inter-node network >> communication is not required for computing factorization and solve >> - independent of the number of cluster, the coarse space matrices are the >> same, have the same number of rows, nnz structure but possibly different >> values >> - there is NO load unbalancing >> - the matrices must be factorized and there are a lot of solves (> 100) >> with them >> >> It should be pretty clear, that computing LU factorization and solving >> with it should scale perfectly. But at the moment, all direct solver I >> tried (mumps, superlu_dist, pastix) are not able to scale. The loos of >> scale is really worse, as you can see from the numbers I send before. >> >> Any ideas? Suggestions? Without a scaling solver method for these kind of >> systems, my multilevel FETI-DP code is just more or less a joke, only some >> orders of magnitude slower than standard FETI-DP method :) >> >> Thomas >> >> Zitat von Jed Brown <jedbrown at mcs.anl.gov>: >> >> MUMPS uses MPI_Iprobe on MPI_COMM_WORLD (hard-coded). What MPI >>> implementation have you been using? Is the behavior different with a >>> different implementation? >>> >>> >>> On Fri, Dec 21, 2012 at 2:36 AM, Thomas Witkowski < >>> thomas.witkowski at tu-dresden.de**> wrote: >>> >>> Okay, I did a similar benchmark now with PETSc's event logging: >>>> >>>> UMFPACK >>>> 16p: Local solve 350 1.0 2.3025e+01 1.1 5.00e+04 1.0 0.0e+00 >>>> 0.0e+00 7.0e+02 63 0 0 0 52 63 0 0 0 51 0 >>>> 64p: Local solve 350 1.0 2.3208e+01 1.1 5.00e+04 1.0 0.0e+00 >>>> 0.0e+00 7.0e+02 60 0 0 0 52 60 0 0 0 51 0 >>>> 256p: Local solve 350 1.0 2.3373e+01 1.1 5.00e+04 1.0 0.0e+00 >>>> 0.0e+00 7.0e+02 49 0 0 0 52 49 0 0 0 51 1 >>>> >>>> MUMPS >>>> 16p: Local solve 350 1.0 4.7183e+01 1.1 5.00e+04 1.0 0.0e+00 >>>> 0.0e+00 7.0e+02 75 0 0 0 52 75 0 0 0 51 0 >>>> 64p: Local solve 350 1.0 7.1409e+01 1.1 5.00e+04 1.0 0.0e+00 >>>> 0.0e+00 7.0e+02 78 0 0 0 52 78 0 0 0 51 0 >>>> 256p: Local solve 350 1.0 2.6079e+02 1.1 5.00e+04 1.0 0.0e+00 >>>> 0.0e+00 7.0e+02 82 0 0 0 52 82 0 0 0 51 0 >>>> >>>> >>>> As you see, the local solves with UMFPACK have nearly constant time with >>>> increasing number of subdomains. This is what I expect. The I replace >>>> UMFPACK by MUMPS and I see increasing time for local solves. In the last >>>> columns, UMFPACK has a decreasing value from 63 to 49, while MUMPS's >>>> column >>>> increases here from 75 to 82. What does this mean? >>>> >>>> Thomas >>>> >>>> Am 21.12.2012 02:19, schrieb Matthew Knepley: >>>> >>>> On Thu, Dec 20, 2012 at 3:39 PM, Thomas Witkowski >>>> >>>>> <Thomas.Witkowski at tu-dresden.****de >>>>> <Thomas.Witkowski at tu-dresden.**de<Thomas.Witkowski at tu-dresden.de> >>>>> >> >>>>> >>>>> wrote: >>>>> >>>>> I cannot use the information from log_summary, as I have three >>>>>> different >>>>>> LU >>>>>> factorizations and solve (local matrices and two hierarchies of coarse >>>>>> grids). Therefore, I use the following work around to get the timing of >>>>>> the >>>>>> solve I'm intrested in: >>>>>> >>>>>> You misunderstand how to use logging. You just put these thing in >>>>> separate stages. Stages represent >>>>> parts of the code over which events are aggregated. >>>>> >>>>> Matt >>>>> >>>>> MPI::COMM_WORLD.Barrier(); >>>>> >>>>>> wtime = MPI::Wtime(); >>>>>> KSPSolve(*(data->ksp_schur_****primal_local), tmp_primal, >>>>>> >>>>>> tmp_primal); >>>>>> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime); >>>>>> >>>>>> The factorization is done explicitly before with "KSPSetUp", so I can >>>>>> measure the time for LU factorization. It also does not scale! For 64 >>>>>> cores, >>>>>> I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all calculations, >>>>>> the >>>>>> local coarse space matrices defined on four cores have exactly the same >>>>>> number of rows and exactly the same number of non zero entries. So, >>>>>> from >>>>>> my >>>>>> point of view, the time should be absolutely constant. >>>>>> >>>>>> Thomas >>>>>> >>>>>> Zitat von Barry Smith <bsmith at mcs.anl.gov>: >>>>>> >>>>>> >>>>>> Are you timing ONLY the time to factor and solve the subproblems? >>>>>> Or >>>>>> >>>>>>> also the time to get the data to the collection of 4 cores at a time? >>>>>>> >>>>>>> If you are only using LU for these problems and not elsewhere in >>>>>>> the >>>>>>> code you can get the factorization and time from MatLUFactor() and >>>>>>> MatSolve() or you can use stages to put this calculation in its own >>>>>>> stage >>>>>>> and use the MatLUFactor() and MatSolve() time from that stage. >>>>>>> Also look at the load balancing column for the factorization and >>>>>>> solve >>>>>>> stage, it is well balanced? >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski >>>>>>> <thomas.witkowski at tu-dresden.****de >>>>>>> <thomas.witkowski at tu-dresden.**de<thomas.witkowski at tu-dresden.de> >>>>>>> >> >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>> In my multilevel FETI-DP code, I have localized course matrices, >>>>>>> which >>>>>>> >>>>>>>> are defined on only a subset of all MPI tasks, typically between 4 >>>>>>>> and 64 >>>>>>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI >>>>>>>> communicator, which is a subset of MPI::COMM_WORLD. The LU >>>>>>>> factorization of >>>>>>>> the matrices is computed with either MUMPS or superlu_dist, but both >>>>>>>> show >>>>>>>> some scaling property I really wonder of: When the overall problem >>>>>>>> size is >>>>>>>> increased, the solve with the LU factorization of the local matrices >>>>>>>> does >>>>>>>> not scale! But why not? I just increase the number of local >>>>>>>> matrices, >>>>>>>> but >>>>>>>> all of them are independent of each other. Some example: I use 64 >>>>>>>> cores, >>>>>>>> each coarse matrix is spanned by 4 cores so there are 16 MPI >>>>>>>> communicators >>>>>>>> with 16 coarse space matrices. The problem need to solve 192 times >>>>>>>> with the >>>>>>>> coarse space systems, and this takes together 0.09 seconds. Now I >>>>>>>> increase >>>>>>>> the number of cores to 256, but let the local coarse space be >>>>>>>> defined >>>>>>>> again >>>>>>>> on only 4 cores. Again, 192 solutions with these coarse spaces are >>>>>>>> required, but now this takes 0.24 seconds. The same for 1024 cores, >>>>>>>> and we >>>>>>>> are at 1.7 seconds for the local coarse space solver! >>>>>>>> >>>>>>>> For me, this is a total mystery! Any idea how to explain, debug and >>>>>>>> eventually how to resolve this problem? >>>>>>>> >>>>>>>> Thomas >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which >>>>> their experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> >>>> >>>> >>> >> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: ex10.c Type: text/x-c++src Size: 3496 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20121221/77093f71/attachment.c> -------------- next part -------------- {P ? ??????6??@a???o??????/a>?|7M? ?? 7??.????? ??P??[??'?#>????B?>?" ??B??E(@I???????????\?l????Y|? ???2?P6;f??@a???f??f ???>???R?Nd? YLr???????????&??>?H??U?>??/.?&>?w?~(l???K?@D????U????Y|? M??????>?*^?7??#????>? 3??]>?T?:???>?u???h??? H?????4gR???????]#?>? ????>>? 3??? YL$??? ??X?? ?.??????????z??????3?>??>??? #X>?? ?^???m?? ?????F.??????n??#????>???R?N$? 7??.?z??;???E??)?j?-?????????<?V V????c??j>??`[e{>??g????>?~???2????????>?*^?6???f ?? >?|7M? ???MpmU??;???k?? ?.??>?W??c^?>?9?Zl???????????? ??S?(???>????z???????]% >?w?~(ln??????????S?(?? >?~???2K????F.?????D~L????7?@???????1???=A$?? >>?^`??w?>?d?t?E????h?M?l???}!}????D? ??>? ????4??K?@D???\?l??????w???????e?>??.??0?>?Y@ #T??>???$<??F?5zU????}!}????_o0????'=Yr>????z????????????? ?%???]????>??-?n?Z??????f>?a/??y???F??a6 ??>?iVj????D? ?????'=Y&??????? H?? >?H??V?>?" ?????????>??`[e|6>?? ?]???^ G?????e????s??d"???2???>??u?rC>??x?????=A$???>?Y@ "?>?a/??zA??4gR?*>??/.?/??E(@I?b??g_?]`???_=2?0k?? ? x??>??u?q#??2s?8?L>?? x8S?>?^`??xE??>???$:??F??a6B?????=>??g???????m?? ???c??Dt?? ?=?a??_>??(>??x??z>?? x8S???2t_??(>?d?t?F???F?5zV??>?iVk>?T?:????????&??[??'??>?W??c^????<?V ????3?>??@????X?>??C?? ?>??06?????^ G?????g_?]a ??c??DtE????D~&????w?????]???>?u???l?????&??>????B?u>??C?? ????????x>??L?=??e????l??_=2?/J?? ?=?`?????7?;?????e?>??-?n??>?9?ZZ????c??%>??? #1>??06???>??L?F???? ?#.??d"*?? ? x????_>??&???????,(>??.??1[??????0
