Jack, I also considered this problem. The 4 MPI tasks of each coarse space matrix should run all on one node (each node contains 4 dual core CPUs). I'm not 100% sure, but I discussed this with the administrators of the system. The system should schedule always the first 8 ranks to the first node, and so on. And the coarse space matrices are build on ranks 0-3, 4-7 ...
I'm running at the moment some benchmarks, where I replaced the local LU factorization from using UMFPACK to MUMPS. Each matrix and the corresponding ksp object are defined on PETSC_COMM_SELF and the problem is perfectly balanced (the grid is a unit square uniformly refined). Lets see... Thomas Zitat von Jack Poulson <jack.poulson at gmail.com>: > Hi Thomas, > > Network topology is important. Since most machines are not fully connected, > random subsets of four processes will become more scattered about the > cluster as you increase your total number of processes. > > Jack > On Dec 20, 2012 12:39 PM, "Thomas Witkowski" <Thomas.Witkowski at > tu-dresden.de> > wrote: > >> I cannot use the information from log_summary, as I have three different >> LU factorizations and solve (local matrices and two hierarchies of coarse >> grids). Therefore, I use the following work around to get the timing of the >> solve I'm intrested in: >> >> MPI::COMM_WORLD.Barrier(); >> wtime = MPI::Wtime(); >> KSPSolve(*(data->ksp_schur_**primal_local), tmp_primal, tmp_primal); >> FetiTimings::fetiSolve03 += (MPI::Wtime() - wtime); >> >> The factorization is done explicitly before with "KSPSetUp", so I can >> measure the time for LU factorization. It also does not scale! For 64 >> cores, I takes 0.05 seconds, for 1024 cores 1.2 seconds. In all >> calculations, the local coarse space matrices defined on four cores have >> exactly the same number of rows and exactly the same number of non zero >> entries. So, from my point of view, the time should be absolutely constant. >> >> Thomas >> >> Zitat von Barry Smith <bsmith at mcs.anl.gov>: >> >> >>> Are you timing ONLY the time to factor and solve the subproblems? Or >>> also the time to get the data to the collection of 4 cores at a time? >>> >>> If you are only using LU for these problems and not elsewhere in the >>> code you can get the factorization and time from MatLUFactor() and >>> MatSolve() or you can use stages to put this calculation in its own stage >>> and use the MatLUFactor() and MatSolve() time from that stage. >>> Also look at the load balancing column for the factorization and solve >>> stage, it is well balanced? >>> >>> Barry >>> >>> On Dec 20, 2012, at 2:16 PM, Thomas Witkowski < >>> thomas.witkowski at tu-dresden.**de <thomas.witkowski at tu-dresden.de>> >>> wrote: >>> >>> In my multilevel FETI-DP code, I have localized course matrices, which >>>> are defined on only a subset of all MPI tasks, typically between 4 and 64 >>>> tasks. The MatAIJ and the KSP objects are both defined on a MPI >>>> communicator, which is a subset of MPI::COMM_WORLD. The LU factorization >>>> of the matrices is computed with either MUMPS or superlu_dist, but both >>>> show some scaling property I really wonder of: When the overall problem >>>> size is increased, the solve with the LU factorization of the local >>>> matrices does not scale! But why not? I just increase the number >>>> of local >>>> matrices, but all of them are independent of each other. Some example: I >>>> use 64 cores, each coarse matrix is spanned by 4 cores so there >>>> are 16 MPI >>>> communicators with 16 coarse space matrices. The problem need to >>>> solve 192 >>>> times with the coarse space systems, and this takes together >>>> 0.09 seconds. >>>> Now I increase the number of cores to 256, but let the local coarse space >>>> be defined again on only 4 cores. Again, 192 solutions with these coarse >>>> spaces are required, but now this takes 0.24 seconds. The same for 1024 >>>> cores, and we are at 1.7 seconds for the local coarse space solver! >>>> >>>> For me, this is a total mystery! Any idea how to explain, debug and >>>> eventually how to resolve this problem? >>>> >>>> Thomas >>>> >>> >>> >>> >> >> >
