MatMatSolve() is supported by the petsc-dev and superlu_dist interface. Hong
On Thu, Aug 18, 2011 at 5:32 PM, Jack Poulson <jack.poulson at gmail.com> wrote: > Hello, > > On Thu, Aug 18, 2011 at 5:11 PM, Xiaoye S. Li <xsli at lbl.gov> wrote: >> >> I can clarify a couple of questions re. SuperLU_DIST. >> >> 1) SuperLU does support multiple right-hand sides.? That is, the B matrix >> on the right can be a dense matrix of size n-by-nrhs.?? Also, the B matrix >> is distributed among all the processors, each processor takes one block row >> of B.? There is no need to have the entire B on every processor. >> > > I certainly agree that this is an improvement over MUMPS. > >> >> 2) We are preparing to upgrade to a newer version.? The parallel >> factorization is improved with a better scheduling algorithm.? This is >> particularly effective using larger core count, say in 100s. >> > > I will be looking forward to the new version. I took a several month hiatus > from trying to use sparse-direct solvers for my subproblems since I wasn't > seeing substantial speedups over the serial algorithm; it was taking a few > hundred cores to solve 3d Helmholtz over 256^3 domains in an hour. > >> >> 3) Regarding memory usage, the factorization algorithm used in SuperLU >> mostly performs "update in-place", and requires just a little bit extra >> working storage.? So, if the load balance is not too bad, the memory per >> core should go down consistently. ? It will be helpful if you can give some >> concreate numbers showing how much memory usage increases with increasing >> core count. >> > > As for specifics, factorizations of 256 x 256 x 10 grids with 7-point > finite-difference stencils required more memory per process when I increased > past ~200 processes. I was storing roughly 50 of these factorizations before > running out of memory, no matter how many more processes I threw at > SuperLU_Dist and MUMPS. I would think that the load balance would be decent > since it is such a regular grid. > Also, if I'm recalling correctly, on 256 cores I was only seeing ~5 GFlops > in each triangle solve against a subdomain, which isn't much of an > improvement over my laptop. Does this sound pathological to you or is it to > be expected from SuperLU_Dist and MUMPS? The linked WSMP paper showed some > impressive triangle solve scalability. Will the new scheduling algorithm > effect the solves as well? > Best Regards, > Jack >> >> >> On Tue, Aug 16, 2011 at 8:34 PM, Rebecca Yuan <rebeccayxf at gmail.com> >> wrote: >>> >>> >>> >>> Begin forwarded message: >>> >>> From: Jack Poulson> >>> Date: August 16, 2011 10:18:16 PM CDT >>> To: For users of the development version of PETSc <petsc-dev at mcs.anl.gov> >>> Subject: Re: [petsc-dev] Wrapper for WSMP >>> Reply-To: For users of the development version of PETSc >>> <petsc-dev at mcs.anl.gov> >>> >>> On Tue, Aug 16, 2011 at 9:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>>> >>>> On Aug 16, 2011, at 5:14 PM, Jack Poulson wrote: >>>> >>>> > Hello all, >>>> > >>>> > I am working on a project that requires very fast sparse direct solves >>>> > and MUMPS and SuperLU_Dist haven't been cutting it. From what I've read, >>>> > when properly tuned, WSMP is significantly faster, particularly with >>>> > multiple right-hand sides on large machines. The obvious drawback is that >>>> > it's not open source, but the binaries seem to be readily available for >>>> > most >>>> > platforms. >>>> > >>>> > Before I reinvent the wheel, I would like to check if anyone has >>>> > already done some work on adding it into PETSc. If not, its interface is >>>> > quite similar to MUMPS and I should be able to mirror most of that code. >>>> > On >>>> > the other hand, there are a large number of platform-specific details >>>> > that >>>> > need to be handled, so keeping things both portable and fast might be a >>>> > challenge. It seems that the CSC storage format should be used since it >>>> > is >>>> > required for Hermitian matrices. >>>> > >>>> > Thanks, >>>> > Jack >>>> >>>> ?Jack, >>>> >>>> ? By all means do it. That would be a nice thing to have. But be aware >>>> that the WSMP folks have a reputation for exaggerating how much better >>>> their >>>> software is so don't be surprised if after all that work it is not much >>>> better. >>>> >>> >>> Good to know. I was somewhat worried about that, but perhaps it is a >>> matter of getting all of the tuning parameters right. The manual does >>> mention that performance is significantly degraded without tuning. I would >>> sincerely hope no one would out right lie in their publications, e.g., this >>> one: >>> http://portal.acm.org/citation.cfm?id=1654061 >>> >>>> >>>> ? BTW: are you solving with many right hand sides? Maybe before you muck >>>> with WSMP we should figure out how to get you access to the multiple right >>>> hand side support of MUMPS (I don't know if SuperLU_Dist has it) so you can >>>> speed up your current computations a good amount? Currently PETSc's >>>> MatMatSolve() calls a separate solve for each right hand side with MUMPS. >>>> >>>> ? Barry >>>> >>> >>> I will eventually need to solve against many right-hand sides, but for >>> now I am solving against one and it is still taking too long; in fact, not >>> only does it take too long, but memory per core increased for fixed problem >>> sizes as I increase the number of MPI processes (for both SuperLU_Dist and >>> MUMPS). This was occurring for quasi2d Helmholtz problems over a couple >>> hundred cores. My only logical explanation for this behavior is that the >>> communication buffers grow proportional to the number of processes on each >>> process, but I stress that this is just a guess. I tried reading through the >>> MUMPS code and quickly gave up. >>> Another problem with MUMPS is that requires the entire set of right-hand >>> sides to reside on the root process...that will clearly not work for a >>> billion degrees of freedom with several hundred RHSs. WSMP gets this part >>> right and actually distributes those vectors. >>> Jack > >
