Hello, On Thu, Aug 18, 2011 at 5:11 PM, Xiaoye S. Li <xsli at lbl.gov> wrote:
> I can clarify a couple of questions re. SuperLU_DIST. > > 1) SuperLU does support multiple right-hand sides. That is, the B matrix > on the right can be a dense matrix of size n-by-nrhs. Also, the B matrix > is distributed among all the processors, each processor takes one block row > of B. There is no need to have the entire B on every processor. > > I certainly agree that this is an improvement over MUMPS. > 2) We are preparing to upgrade to a newer version. The parallel > factorization is improved with a better scheduling algorithm. This is > particularly effective using larger core count, say in 100s. > > I will be looking forward to the new version. I took a several month hiatus from trying to use sparse-direct solvers for my subproblems since I wasn't seeing substantial speedups over the serial algorithm; it was taking a few hundred cores to solve 3d Helmholtz over 256^3 domains in an hour. > 3) Regarding memory usage, the factorization algorithm used in SuperLU > mostly performs "update in-place", and requires just a little bit extra > working storage. So, if the load balance is not too bad, the memory per > core should go down consistently. It will be helpful if you can give some > concreate numbers showing how much memory usage increases with increasing > core count. > > As for specifics, factorizations of 256 x 256 x 10 grids with 7-point finite-difference stencils required more memory per process when I increased past ~200 processes. I was storing roughly 50 of these factorizations before running out of memory, no matter how many more processes I threw at SuperLU_Dist and MUMPS. I would think that the load balance would be decent since it is such a regular grid. Also, if I'm recalling correctly, on 256 cores I was only seeing ~5 GFlops in each triangle solve against a subdomain, which isn't much of an improvement over my laptop. Does this sound pathological to you or is it to be expected from SuperLU_Dist and MUMPS? The linked WSMP paper showed some impressive triangle solve scalability. Will the new scheduling algorithm effect the solves as well? Best Regards, Jack > > On Tue, Aug 16, 2011 at 8:34 PM, Rebecca Yuan <rebeccayxf at gmail.com>wrote: > >> >> >> >> Begin forwarded message: >> >> *From:* Jack Poulson <jack.poulson at gmail.com>> >> *Date:* August 16, 2011 10:18:16 PM CDT >> >> *To:* For users of the development version of PETSc < >> petsc-dev at mcs.anl.gov> >> *Subject:* *Re: [petsc-dev] Wrapper for WSMP* >> *Reply-To:* For users of the development version of PETSc < >> petsc-dev at mcs.anl.gov> >> >> On Tue, Aug 16, 2011 at 9:35 PM, Barry Smith < <bsmith at mcs.anl.gov> >> bsmith at mcs.anl.gov> wrote: >> >>> >>> On Aug 16, 2011, at 5:14 PM, Jack Poulson wrote: >>> >>> > Hello all, >>> > >>> > I am working on a project that requires very fast sparse direct solves >>> and MUMPS and SuperLU_Dist haven't been cutting it. From what I've read, >>> when properly tuned, WSMP is significantly faster, particularly with >>> multiple right-hand sides on large machines. The obvious drawback is that >>> it's not open source, but the binaries seem to be readily available for most >>> platforms. >>> > >>> > Before I reinvent the wheel, I would like to check if anyone has >>> already done some work on adding it into PETSc. If not, its interface is >>> quite similar to MUMPS and I should be able to mirror most of that code. On >>> the other hand, there are a large number of platform-specific details that >>> need to be handled, so keeping things both portable and fast might be a >>> challenge. It seems that the CSC storage format should be used since it is >>> required for Hermitian matrices. >>> > >>> > Thanks, >>> > Jack >>> >>> Jack, >>> >>> By all means do it. That would be a nice thing to have. But be aware >>> that the WSMP folks have a reputation for exaggerating how much better their >>> software is so don't be surprised if after all that work it is not much >>> better. >>> >>> >> Good to know. I was somewhat worried about that, but perhaps it is a >> matter of getting all of the tuning parameters right. The manual does >> mention that performance is significantly degraded without tuning. I would >> sincerely hope no one would out right lie in their publications, e.g., this >> one: >> <http://portal.acm.org/citation.cfm?id=1654061> >> http://portal.acm.org/citation.cfm?id=1654061 >> >> >>> BTW: are you solving with many right hand sides? Maybe before you muck >>> with WSMP we should figure out how to get you access to the multiple right >>> hand side support of MUMPS (I don't know if SuperLU_Dist has it) so you can >>> speed up your current computations a good amount? Currently PETSc's >>> MatMatSolve() calls a separate solve for each right hand side with MUMPS. >>> >>> Barry >>> >>> >> I will eventually need to solve against many right-hand sides, but for now >> I am solving against one and it is still taking too long; in fact, not only >> does it take too long, but memory per core increased for fixed problem sizes >> as I increase the number of MPI processes (for both SuperLU_Dist and MUMPS). >> This was occurring for quasi2d Helmholtz problems over a couple hundred >> cores. My only logical explanation for this behavior is that the >> communication buffers grow proportional to the number of processes on each >> process, but I stress that this is just a guess. I tried reading through the >> MUMPS code and quickly gave up. >> >> Another problem with MUMPS is that requires the entire set of right-hand >> sides to reside on the root process...that will clearly not work for a >> billion degrees of freedom with several hundred RHSs. WSMP gets this part >> right and actually distributes those vectors. >> >> Jack >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110818/eeef4ef2/attachment.html>
