For superlu_dist, you can try: options.ReplaceTinyPivot = NO; (I think default is YES)
and/or options.IterRefine = YES; Sherry Li On Sun, Mar 2, 2014 at 2:23 PM, Matt Landreman <[email protected]>wrote: > Hi, > > I'm having some problems with my PETSc application similar to the ones > discussed in this thread, so perhaps one of you can help. In my application > I factorize a preconditioner matrix with mumps or superlu_dist, using this > factorized preconditioner to accelerate gmres on a matrix that is denser > than the preconditioner. I've been running on edison at nersc. My program > works reliably for problem sizes below about 1 million x 1 million, but > above this size, the factorization step fails in one of many possible ways, > depending on the compiler, # of nodes, # of procs/node, etc: > > When I use superlu_dist, I get 1 of 2 failure modes: > (1) the first step of KSP returns "0 KSP residual norm -nan" and ksp then > returns KSPConvergedReason = -9, or > (2) the factorization completes, but GMRES then converges excruciatingly > slowly or not at all, even if I choose the "real" matrix to be identical to > the preconditioner matrix so KSP ought to converge in 1 step (which it does > for smaller matrices). > > For mumps, the factorization can fail in many different ways: > (3) With the intel compiler I usually get "Caught signal number 11 SEGV: > Segmentation Violation" > (4) Sometimes with the intel compiler I get "Caught signal number 7 BUS: > Bus Error" > (5) With the gnu compiler I often get a bunch of lines like "problem with > NIV2_FLOPS message -5.9604644775390625E-008 0 > -227464733.99999997" > (6) Other times with gnu I get a mumps error with INFO(1)=-9 or > INFO(1)=-17. The mumps documentation suggests I should increase icntl(14), > but what is an appropriate value? 50? 10000? > (7) With the Cray compiler I consistently get this cryptic error: > Fatal error in PMPI_Test: Invalid MPI_Request, error stack: > PMPI_Test(166): MPI_Test(request=0xb228dbf3c, flag=0x7ffffffe097c, > status=0x7ffffffe0a00) failed > PMPI_Test(121): Invalid MPI_Request > _pmiu_daemon(SIGCHLD): [NID 02784] [c6-1c1s8n0] [Sun Mar 2 10:35:20 2014] > PE RANK 0 exit signal Aborted > [NID 02784] 2014-03-02 10:35:20 Apid 3374579: initiated application > termination > Application 3374579 exit codes: 134 > > For linear systems smaller than around 1 million^2, my application is very > robust, working consistently with both mumps & superlu_dist, working for a > wide range of # of nodes and # of procs/node, and working with all 3 > available compilers on edison (intel, gnu, cray). > > By the way, mumps failed for much smaller problems until I tried > -mat_mumps_icntl_7 2 (inspired by your conversation last week). I tried all > the other options for icntl(7), icntl(28), and icntl(29), finding > icntl(7)=2 works best by far. I tried the flags that worked for Samar > (-mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1) with > superlu_dist, but they did not appear to change anything in my case. > > Can you recommend any other parameters of petsc, superlu_dist, or mumps > that I should try changing? I don't care in the end whether I use > superlu_dist or mumps. > > Thanks! > > Matt Landreman > > > On Tue, Feb 25, 2014 at 3:50 PM, Xiaoye S. Li <[email protected]> wrote: > >> Very good! Thanks for the update. >> I guess you are using all 16 cores per node? Since superlu_dist >> currently is MPI-only, if you generate 16 MPI tasks, serial symbolic >> factorization only has less than 2 GB memory to work with. >> >> Sherry >> >> >> On Tue, Feb 25, 2014 at 12:22 PM, Samar Khatiwala >> <[email protected]>wrote: >> >>> Hi Sherry, >>> >>> Thanks! I tried your suggestions and it worked! >>> >>> For the record I added these flags: -mat_superlu_dist_colperm PARMETIS >>> -mat_superlu_dist_parsymbfact 1 >>> >>> Also, for completeness and since you asked: >>> >>> size: 2346346 x 2346346 >>> nnz: 60856894 >>> unsymmetric >>> >>> The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware) >>> specs are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node) >>> I've been running on 8 nodes (so 8 x 27 ~ 216 GB). >>> >>> Thanks again for your help! >>> >>> Samar >>> >>> On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <[email protected]> wrote: >>> >>> I didn't follow the discussion thread closely ... How large is your >>> matrix dimension, and number of nonzeros? >>> How large is the memory per core (or per node)? >>> >>> The default setting in superlu_dist is to use serial symbolic >>> factorization. You can turn on parallel symbolic factorization by: >>> >>> options.ParSymbFact = YES; >>> options.ColPerm = PARMETIS; >>> >>> Is your matrix symmetric? if so, you need to give both upper and lower >>> half of matrix A to superlu, which doesn't exploit symmetry. >>> >>> Do you know whether you need numerical pivoting? If not, you can turn >>> off pivoting by: >>> >>> options.RowPerm = NATURAL; >>> >>> This avoids some other serial bottleneck. >>> >>> All these options can be turned on in the petsc interface. Please check >>> out the syntax there. >>> >>> >>> Sherry >>> >>> >>> >>> On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala >>> <[email protected]>wrote: >>> >>>> Hi Barry, >>>> >>>> You're probably right. I note that the error occurs almost instantly >>>> and I've tried increasing the number of CPUs >>>> (as many as ~1000 on Yellowstone) to no avail. I know this is a big >>>> problem but I didn't think it was that big! >>>> >>>> Sherry: Is there any way to write out more diagnostic info? E.g.,how >>>> much memory superlu thinks it needs/is attempting >>>> to allocate. >>>> >>>> Thanks, >>>> >>>> Samar >>>> >>>> On Feb 25, 2014, at 10:57 AM, Barry Smith <[email protected]> wrote: >>>> > >>>> >> >>>> >> I tried superlu_dist again and it crashes even more quickly than >>>> MUMPS with just the following error: >>>> >> >>>> >> ERROR: 0031-250 task 128: Killed >>>> > >>>> > This is usually a symptom of running out of memory. >>>> > >>>> >> >>>> >> Absolutely nothing else is written out to either stderr or stdout. >>>> This is with -mat_superlu_dist_statprint. >>>> >> The program works fine on a smaller matrix. >>>> >> >>>> >> This is the sequence of calls: >>>> >> >>>> >> KSPSetType(ksp,KSPPREONLY); >>>> >> PCSetType(pc,PCLU); >>>> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST); >>>> >> KSPSetFromOptions(ksp); >>>> >> PCSetFromOptions(pc); >>>> >> KSPSolve(ksp,b,x); >>>> >> >>>> >> All of these successfully return *except* the very last one to >>>> KSPSolve. >>>> >> >>>> >> Any help would be appreciated. Thanks! >>>> >> >>>> >> Samar >>>> >> >>>> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <[email protected]> wrote: >>>> >> >>>> >>> Samar: >>>> >>> If you include the error message while crashing using superlu_dist, >>>> I probably know the reason. (better yet, include the printout before the >>>> crash. ) >>>> >>> >>>> >>> Sherry >>>> >>> >>>> >>> >>>> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <[email protected]> >>>> wrote: >>>> >>> Samar : >>>> >>> There are limitations for direct solvers. >>>> >>> Do not expect any solver can be used on arbitrarily large problems. >>>> >>> Since superlu_dist also crashes, direct solvers may not be able to >>>> work on your application. >>>> >>> This is why I suggest to increase size incrementally. >>>> >>> You may have to experiment other type of solvers. >>>> >>> >>>> >>> Hong >>>> >>> >>>> >>> Hi Hong and Jed, >>>> >>> >>>> >>> Many thanks for replying. It would indeed be nice if the error >>>> messages from MUMPS were less cryptic! >>>> >>> >>>> >>> 1) I have tried smaller matrices although given how my problem is >>>> set up a jump is difficult to avoid. But a good idea >>>> >>> that I will try. >>>> >>> >>>> >>> 2) I did try various ordering but not the one you suggested. >>>> >>> >>>> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt >>>> termination of the program (there should be more >>>> >>> error messages if, for example, memory was a problem). I therefore >>>> thought it might be an interface problem rather than >>>> >>> one with mumps and turned to the petsc-users group first. >>>> >>> >>>> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to >>>> why) at which point I decided to try mumps. The fact that both >>>> >>> crash would again indicate a common (memory?) problem. >>>> >>> >>>> >>> I'll try a few more things before asking the MUMPS developers. >>>> >>> >>>> >>> Thanks again for your help! >>>> >>> >>>> >>> Samar >>>> >>> >>>> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <[email protected]> >>>> wrote: >>>> >>> >>>> >>>> Samar: >>>> >>>> The crash occurs in >>>> >>>> ... >>>> >>>> [161]PETSC ERROR: Error in external library! >>>> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical >>>> factorization phase: INFO(1)=-1, INFO(2)=48 >>>> >>>> >>>> >>>> for very large matrix, likely memory problem as you suspected. >>>> >>>> I would suggest >>>> >>>> 1. run problems with increased sizes (not jump from a small one to >>>> a very large one) and observe memory usage using >>>> >>>> '-ksp_view'. >>>> >>>> I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of >>>> estimated workspace increase. Is it too large? >>>> >>>> Anyway, this input should not cause the crash, I guess. >>>> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 >>>> <> (I usually use sequential ordering 2) >>>> >>>> I see you use parallel ordering -mat_mumps_icntl_29 2. >>>> >>>> 3. send bug report to mumps developers for their suggestion. >>>> >>>> >>>> >>>> 4. try other direct solvers, e.g., superlu_dist. >>>> >>>> >>>> >>>> ... >>>> >>>> >>>> >>>> etc etc. The above error I can tell has something to do with >>>> processor 48 (INFO(2)) and so forth but not the previous one. >>>> >>>> >>>> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the >>>> attached file. Any hints as to what could be giving this >>>> >>>> error would be very much appreciated. >>>> >>>> >>>> >>>> I do not know how to interpret this output file. mumps developer >>>> would give you better suggestion on it. >>>> >>>> I would appreciate to learn as well :-) >>>> >>>> >>>> >>>> Hong >>>> >>> >>>> >>> >>>> >>> >>>> >> >>>> > >>>> >>>> >>> >>> >> >
