Hi Barry, You're probably right. I note that the error occurs almost instantly and I've tried increasing the number of CPUs (as many as ~1000 on Yellowstone) to no avail. I know this is a big problem but I didn't think it was that big!
Sherry: Is there any way to write out more diagnostic info? E.g.,how much memory superlu thinks it needs/is attempting to allocate. Thanks, Samar On Feb 25, 2014, at 10:57 AM, Barry Smith <[email protected]> wrote: > >> >> I tried superlu_dist again and it crashes even more quickly than MUMPS with >> just the following error: >> >> ERROR: 0031-250 task 128: Killed > > This is usually a symptom of running out of memory. > >> >> Absolutely nothing else is written out to either stderr or stdout. This is >> with -mat_superlu_dist_statprint. >> The program works fine on a smaller matrix. >> >> This is the sequence of calls: >> >> KSPSetType(ksp,KSPPREONLY); >> PCSetType(pc,PCLU); >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST); >> KSPSetFromOptions(ksp); >> PCSetFromOptions(pc); >> KSPSolve(ksp,b,x); >> >> All of these successfully return *except* the very last one to KSPSolve. >> >> Any help would be appreciated. Thanks! >> >> Samar >> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <[email protected]> wrote: >> >>> Samar: >>> If you include the error message while crashing using superlu_dist, I >>> probably know the reason. (better yet, include the printout before the >>> crash. ) >>> >>> Sherry >>> >>> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <[email protected]> wrote: >>> Samar : >>> There are limitations for direct solvers. >>> Do not expect any solver can be used on arbitrarily large problems. >>> Since superlu_dist also crashes, direct solvers may not be able to work on >>> your application. >>> This is why I suggest to increase size incrementally. >>> You may have to experiment other type of solvers. >>> >>> Hong >>> >>> Hi Hong and Jed, >>> >>> Many thanks for replying. It would indeed be nice if the error messages >>> from MUMPS were less cryptic! >>> >>> 1) I have tried smaller matrices although given how my problem is set up a >>> jump is difficult to avoid. But a good idea >>> that I will try. >>> >>> 2) I did try various ordering but not the one you suggested. >>> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt >>> termination of the program (there should be more >>> error messages if, for example, memory was a problem). I therefore thought >>> it might be an interface problem rather than >>> one with mumps and turned to the petsc-users group first. >>> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to why) at >>> which point I decided to try mumps. The fact that both >>> crash would again indicate a common (memory?) problem. >>> >>> I'll try a few more things before asking the MUMPS developers. >>> >>> Thanks again for your help! >>> >>> Samar >>> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <[email protected]> wrote: >>> >>>> Samar: >>>> The crash occurs in >>>> ... >>>> [161]PETSC ERROR: Error in external library! >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization >>>> phase: INFO(1)=-1, INFO(2)=48 >>>> >>>> for very large matrix, likely memory problem as you suspected. >>>> I would suggest >>>> 1. run problems with increased sizes (not jump from a small one to a very >>>> large one) and observe memory usage using >>>> '-ksp_view'. >>>> I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of estimated >>>> workspace increase. Is it too large? >>>> Anyway, this input should not cause the crash, I guess. >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <> (I >>>> usually use sequential ordering 2) >>>> I see you use parallel ordering -mat_mumps_icntl_29 2. >>>> 3. send bug report to mumps developers for their suggestion. >>>> >>>> 4. try other direct solvers, e.g., superlu_dist. >>>> >>>> … >>>> >>>> etc etc. The above error I can tell has something to do with processor 48 >>>> (INFO(2)) and so forth but not the previous one. >>>> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the attached >>>> file. Any hints as to what could be giving this >>>> error would be very much appreciated. >>>> >>>> I do not know how to interpret this output file. mumps developer would >>>> give you better suggestion on it. >>>> I would appreciate to learn as well :-) >>>> >>>> Hong >>> >>> >>> >> >
