Hi Sherry, Thanks! I tried your suggestions and it worked!
For the record I added these flags: -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1 Also, for completeness and since you asked: size: 2346346 x 2346346 nnz: 60856894 unsymmetric The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware) specs are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node) I've been running on 8 nodes (so 8 x 27 ~ 216 GB). Thanks again for your help! Samar On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <[email protected]> wrote: > I didn't follow the discussion thread closely ... How large is your matrix > dimension, and number of nonzeros? > How large is the memory per core (or per node)? > > The default setting in superlu_dist is to use serial symbolic factorization. > You can turn on parallel symbolic factorization by: > > options.ParSymbFact = YES; > options.ColPerm = PARMETIS; > > Is your matrix symmetric? if so, you need to give both upper and lower half > of matrix A to superlu, which doesn't exploit symmetry. > > Do you know whether you need numerical pivoting? If not, you can turn off > pivoting by: > > options.RowPerm = NATURAL; > > This avoids some other serial bottleneck. > > All these options can be turned on in the petsc interface. Please check out > the syntax there. > > > Sherry > > > > On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <[email protected]> > wrote: > Hi Barry, > > You're probably right. I note that the error occurs almost instantly and I've > tried increasing the number of CPUs > (as many as ~1000 on Yellowstone) to no avail. I know this is a big problem > but I didn't think it was that big! > > Sherry: Is there any way to write out more diagnostic info? E.g.,how much > memory superlu thinks it needs/is attempting > to allocate. > > Thanks, > > Samar > > On Feb 25, 2014, at 10:57 AM, Barry Smith <[email protected]> wrote: > > > >> > >> I tried superlu_dist again and it crashes even more quickly than MUMPS > >> with just the following error: > >> > >> ERROR: 0031-250 task 128: Killed > > > > This is usually a symptom of running out of memory. > > > >> > >> Absolutely nothing else is written out to either stderr or stdout. This is > >> with -mat_superlu_dist_statprint. > >> The program works fine on a smaller matrix. > >> > >> This is the sequence of calls: > >> > >> KSPSetType(ksp,KSPPREONLY); > >> PCSetType(pc,PCLU); > >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST); > >> KSPSetFromOptions(ksp); > >> PCSetFromOptions(pc); > >> KSPSolve(ksp,b,x); > >> > >> All of these successfully return *except* the very last one to KSPSolve. > >> > >> Any help would be appreciated. Thanks! > >> > >> Samar > >> > >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <[email protected]> wrote: > >> > >>> Samar: > >>> If you include the error message while crashing using superlu_dist, I > >>> probably know the reason. (better yet, include the printout before the > >>> crash. ) > >>> > >>> Sherry > >>> > >>> > >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <[email protected]> wrote: > >>> Samar : > >>> There are limitations for direct solvers. > >>> Do not expect any solver can be used on arbitrarily large problems. > >>> Since superlu_dist also crashes, direct solvers may not be able to work > >>> on your application. > >>> This is why I suggest to increase size incrementally. > >>> You may have to experiment other type of solvers. > >>> > >>> Hong > >>> > >>> Hi Hong and Jed, > >>> > >>> Many thanks for replying. It would indeed be nice if the error messages > >>> from MUMPS were less cryptic! > >>> > >>> 1) I have tried smaller matrices although given how my problem is set up > >>> a jump is difficult to avoid. But a good idea > >>> that I will try. > >>> > >>> 2) I did try various ordering but not the one you suggested. > >>> > >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt > >>> termination of the program (there should be more > >>> error messages if, for example, memory was a problem). I therefore > >>> thought it might be an interface problem rather than > >>> one with mumps and turned to the petsc-users group first. > >>> > >>> 4) I've tried superlu_dist but it also crashes (also unclear as to why) > >>> at which point I decided to try mumps. The fact that both > >>> crash would again indicate a common (memory?) problem. > >>> > >>> I'll try a few more things before asking the MUMPS developers. > >>> > >>> Thanks again for your help! > >>> > >>> Samar > >>> > >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <[email protected]> wrote: > >>> > >>>> Samar: > >>>> The crash occurs in > >>>> ... > >>>> [161]PETSC ERROR: Error in external library! > >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization > >>>> phase: INFO(1)=-1, INFO(2)=48 > >>>> > >>>> for very large matrix, likely memory problem as you suspected. > >>>> I would suggest > >>>> 1. run problems with increased sizes (not jump from a small one to a > >>>> very large one) and observe memory usage using > >>>> '-ksp_view'. > >>>> I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of > >>>> estimated workspace increase. Is it too large? > >>>> Anyway, this input should not cause the crash, I guess. > >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <> (I > >>>> usually use sequential ordering 2) > >>>> I see you use parallel ordering -mat_mumps_icntl_29 2. > >>>> 3. send bug report to mumps developers for their suggestion. > >>>> > >>>> 4. try other direct solvers, e.g., superlu_dist. > >>>> > >>>> … > >>>> > >>>> etc etc. The above error I can tell has something to do with processor > >>>> 48 (INFO(2)) and so forth but not the previous one. > >>>> > >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the > >>>> attached file. Any hints as to what could be giving this > >>>> error would be very much appreciated. > >>>> > >>>> I do not know how to interpret this output file. mumps developer would > >>>> give you better suggestion on it. > >>>> I would appreciate to learn as well :-) > >>>> > >>>> Hong > >>> > >>> > >>> > >> > > > >
