I didn't follow the discussion thread closely ... How large is your matrix dimension, and number of nonzeros? How large is the memory per core (or per node)?
The default setting in superlu_dist is to use serial symbolic factorization. You can turn on parallel symbolic factorization by: options.ParSymbFact = YES; options.ColPerm = PARMETIS; Is your matrix symmetric? if so, you need to give both upper and lower half of matrix A to superlu, which doesn't exploit symmetry. Do you know whether you need numerical pivoting? If not, you can turn off pivoting by: options.RowPerm = NATURAL; This avoids some other serial bottleneck. All these options can be turned on in the petsc interface. Please check out the syntax there. Sherry On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <[email protected]>wrote: > Hi Barry, > > You're probably right. I note that the error occurs almost instantly and > I've tried increasing the number of CPUs > (as many as ~1000 on Yellowstone) to no avail. I know this is a big > problem but I didn't think it was that big! > > Sherry: Is there any way to write out more diagnostic info? E.g.,how much > memory superlu thinks it needs/is attempting > to allocate. > > Thanks, > > Samar > > On Feb 25, 2014, at 10:57 AM, Barry Smith <[email protected]> wrote: > > > >> > >> I tried superlu_dist again and it crashes even more quickly than MUMPS > with just the following error: > >> > >> ERROR: 0031-250 task 128: Killed > > > > This is usually a symptom of running out of memory. > > > >> > >> Absolutely nothing else is written out to either stderr or stdout. This > is with -mat_superlu_dist_statprint. > >> The program works fine on a smaller matrix. > >> > >> This is the sequence of calls: > >> > >> KSPSetType(ksp,KSPPREONLY); > >> PCSetType(pc,PCLU); > >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST); > >> KSPSetFromOptions(ksp); > >> PCSetFromOptions(pc); > >> KSPSolve(ksp,b,x); > >> > >> All of these successfully return *except* the very last one to KSPSolve. > >> > >> Any help would be appreciated. Thanks! > >> > >> Samar > >> > >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <[email protected]> wrote: > >> > >>> Samar: > >>> If you include the error message while crashing using superlu_dist, I > probably know the reason. (better yet, include the printout before the > crash. ) > >>> > >>> Sherry > >>> > >>> > >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <[email protected]> > wrote: > >>> Samar : > >>> There are limitations for direct solvers. > >>> Do not expect any solver can be used on arbitrarily large problems. > >>> Since superlu_dist also crashes, direct solvers may not be able to > work on your application. > >>> This is why I suggest to increase size incrementally. > >>> You may have to experiment other type of solvers. > >>> > >>> Hong > >>> > >>> Hi Hong and Jed, > >>> > >>> Many thanks for replying. It would indeed be nice if the error > messages from MUMPS were less cryptic! > >>> > >>> 1) I have tried smaller matrices although given how my problem is set > up a jump is difficult to avoid. But a good idea > >>> that I will try. > >>> > >>> 2) I did try various ordering but not the one you suggested. > >>> > >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt > termination of the program (there should be more > >>> error messages if, for example, memory was a problem). I therefore > thought it might be an interface problem rather than > >>> one with mumps and turned to the petsc-users group first. > >>> > >>> 4) I've tried superlu_dist but it also crashes (also unclear as to > why) at which point I decided to try mumps. The fact that both > >>> crash would again indicate a common (memory?) problem. > >>> > >>> I'll try a few more things before asking the MUMPS developers. > >>> > >>> Thanks again for your help! > >>> > >>> Samar > >>> > >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <[email protected]> wrote: > >>> > >>>> Samar: > >>>> The crash occurs in > >>>> ... > >>>> [161]PETSC ERROR: Error in external library! > >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization > phase: INFO(1)=-1, INFO(2)=48 > >>>> > >>>> for very large matrix, likely memory problem as you suspected. > >>>> I would suggest > >>>> 1. run problems with increased sizes (not jump from a small one to a > very large one) and observe memory usage using > >>>> '-ksp_view'. > >>>> I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of > estimated workspace increase. Is it too large? > >>>> Anyway, this input should not cause the crash, I guess. > >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <> > (I usually use sequential ordering 2) > >>>> I see you use parallel ordering -mat_mumps_icntl_29 2. > >>>> 3. send bug report to mumps developers for their suggestion. > >>>> > >>>> 4. try other direct solvers, e.g., superlu_dist. > >>>> > >>>> ... > >>>> > >>>> etc etc. The above error I can tell has something to do with > processor 48 (INFO(2)) and so forth but not the previous one. > >>>> > >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the > attached file. Any hints as to what could be giving this > >>>> error would be very much appreciated. > >>>> > >>>> I do not know how to interpret this output file. mumps developer > would give you better suggestion on it. > >>>> I would appreciate to learn as well :-) > >>>> > >>>> Hong > >>> > >>> > >>> > >> > > > >
