Very good! Thanks for the update. I guess you are using all 16 cores per node? Since superlu_dist currently is MPI-only, if you generate 16 MPI tasks, serial symbolic factorization only has less than 2 GB memory to work with.
Sherry On Tue, Feb 25, 2014 at 12:22 PM, Samar Khatiwala <[email protected]>wrote: > Hi Sherry, > > Thanks! I tried your suggestions and it worked! > > For the record I added these flags: -mat_superlu_dist_colperm PARMETIS > -mat_superlu_dist_parsymbfact 1 > > Also, for completeness and since you asked: > > size: 2346346 x 2346346 > nnz: 60856894 > unsymmetric > > The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware) > specs are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node) > I've been running on 8 nodes (so 8 x 27 ~ 216 GB). > > Thanks again for your help! > > Samar > > On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <[email protected]> wrote: > > I didn't follow the discussion thread closely ... How large is your matrix > dimension, and number of nonzeros? > How large is the memory per core (or per node)? > > The default setting in superlu_dist is to use serial symbolic > factorization. You can turn on parallel symbolic factorization by: > > options.ParSymbFact = YES; > options.ColPerm = PARMETIS; > > Is your matrix symmetric? if so, you need to give both upper and lower > half of matrix A to superlu, which doesn't exploit symmetry. > > Do you know whether you need numerical pivoting? If not, you can turn off > pivoting by: > > options.RowPerm = NATURAL; > > This avoids some other serial bottleneck. > > All these options can be turned on in the petsc interface. Please check > out the syntax there. > > > Sherry > > > > On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala > <[email protected]>wrote: > >> Hi Barry, >> >> You're probably right. I note that the error occurs almost instantly and >> I've tried increasing the number of CPUs >> (as many as ~1000 on Yellowstone) to no avail. I know this is a big >> problem but I didn't think it was that big! >> >> Sherry: Is there any way to write out more diagnostic info? E.g.,how much >> memory superlu thinks it needs/is attempting >> to allocate. >> >> Thanks, >> >> Samar >> >> On Feb 25, 2014, at 10:57 AM, Barry Smith <[email protected]> wrote: >> > >> >> >> >> I tried superlu_dist again and it crashes even more quickly than MUMPS >> with just the following error: >> >> >> >> ERROR: 0031-250 task 128: Killed >> > >> > This is usually a symptom of running out of memory. >> > >> >> >> >> Absolutely nothing else is written out to either stderr or stdout. >> This is with -mat_superlu_dist_statprint. >> >> The program works fine on a smaller matrix. >> >> >> >> This is the sequence of calls: >> >> >> >> KSPSetType(ksp,KSPPREONLY); >> >> PCSetType(pc,PCLU); >> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST); >> >> KSPSetFromOptions(ksp); >> >> PCSetFromOptions(pc); >> >> KSPSolve(ksp,b,x); >> >> >> >> All of these successfully return *except* the very last one to >> KSPSolve. >> >> >> >> Any help would be appreciated. Thanks! >> >> >> >> Samar >> >> >> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <[email protected]> wrote: >> >> >> >>> Samar: >> >>> If you include the error message while crashing using superlu_dist, I >> probably know the reason. (better yet, include the printout before the >> crash. ) >> >>> >> >>> Sherry >> >>> >> >>> >> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <[email protected]> >> wrote: >> >>> Samar : >> >>> There are limitations for direct solvers. >> >>> Do not expect any solver can be used on arbitrarily large problems. >> >>> Since superlu_dist also crashes, direct solvers may not be able to >> work on your application. >> >>> This is why I suggest to increase size incrementally. >> >>> You may have to experiment other type of solvers. >> >>> >> >>> Hong >> >>> >> >>> Hi Hong and Jed, >> >>> >> >>> Many thanks for replying. It would indeed be nice if the error >> messages from MUMPS were less cryptic! >> >>> >> >>> 1) I have tried smaller matrices although given how my problem is set >> up a jump is difficult to avoid. But a good idea >> >>> that I will try. >> >>> >> >>> 2) I did try various ordering but not the one you suggested. >> >>> >> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt >> termination of the program (there should be more >> >>> error messages if, for example, memory was a problem). I therefore >> thought it might be an interface problem rather than >> >>> one with mumps and turned to the petsc-users group first. >> >>> >> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to >> why) at which point I decided to try mumps. The fact that both >> >>> crash would again indicate a common (memory?) problem. >> >>> >> >>> I'll try a few more things before asking the MUMPS developers. >> >>> >> >>> Thanks again for your help! >> >>> >> >>> Samar >> >>> >> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <[email protected]> wrote: >> >>> >> >>>> Samar: >> >>>> The crash occurs in >> >>>> ... >> >>>> [161]PETSC ERROR: Error in external library! >> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization >> phase: INFO(1)=-1, INFO(2)=48 >> >>>> >> >>>> for very large matrix, likely memory problem as you suspected. >> >>>> I would suggest >> >>>> 1. run problems with increased sizes (not jump from a small one to a >> very large one) and observe memory usage using >> >>>> '-ksp_view'. >> >>>> I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of >> estimated workspace increase. Is it too large? >> >>>> Anyway, this input should not cause the crash, I guess. >> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 >> <> (I usually use sequential ordering 2) >> >>>> I see you use parallel ordering -mat_mumps_icntl_29 2. >> >>>> 3. send bug report to mumps developers for their suggestion. >> >>>> >> >>>> 4. try other direct solvers, e.g., superlu_dist. >> >>>> >> >>>> ... >> >>>> >> >>>> etc etc. The above error I can tell has something to do with >> processor 48 (INFO(2)) and so forth but not the previous one. >> >>>> >> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the >> attached file. Any hints as to what could be giving this >> >>>> error would be very much appreciated. >> >>>> >> >>>> I do not know how to interpret this output file. mumps developer >> would give you better suggestion on it. >> >>>> I would appreciate to learn as well :-) >> >>>> >> >>>> Hong >> >>> >> >>> >> >>> >> >> >> > >> >> > >
