Hi Sherry,

Thanks! I tried your suggestions and it worked!

For the record I added these flags: -mat_superlu_dist_colperm PARMETIS 
-mat_superlu_dist_parsymbfact 1 

Also, for completeness and since you asked:

size: 2346346 x 2346346
nnz:  60856894
unsymmetric

The hardware (http://www2.cisl.ucar.edu/resources/yellowstone/hardware) specs 
are: 2 GB/core, 32 GB/node (27 GB usable), (16 cores per node)
I've been running on 8 nodes (so 8 x 27 ~ 216 GB).

Thanks again for your help!

Samar

On Feb 25, 2014, at 1:00 PM, "Xiaoye S. Li" <[email protected]> wrote:

> I didn't follow the discussion thread closely ... How large is your matrix 
> dimension, and number of nonzeros?
> How large is the memory per core (or per node)?  
> 
> The default setting in superlu_dist is to use serial symbolic factorization. 
> You can turn on parallel symbolic factorization by:
> 
> options.ParSymbFact = YES;
> options.ColPerm = PARMETIS;
> 
> Is your matrix symmetric?  if so, you need to give both upper and lower half 
> of matrix A to superlu, which doesn't exploit symmetry.
> 
> Do you know whether you need numerical pivoting?  If not, you can turn off 
> pivoting by:
> 
> options.RowPerm = NATURAL;
> 
> This avoids some other serial bottleneck.
> 
> All these options can be turned on in the petsc interface. Please check out 
> the syntax there.
> 
> 
> Sherry
> 
> 
> 
> On Tue, Feb 25, 2014 at 8:07 AM, Samar Khatiwala <[email protected]> 
> wrote:
> Hi Barry,
> 
> You're probably right. I note that the error occurs almost instantly and I've 
> tried increasing the number of CPUs
> (as many as ~1000 on Yellowstone) to no avail. I know this is a big problem 
> but I didn't think it was that big!
> 
> Sherry: Is there any way to write out more diagnostic info? E.g.,how much 
> memory superlu thinks it needs/is attempting
> to allocate.
> 
> Thanks,
> 
> Samar
> 
> On Feb 25, 2014, at 10:57 AM, Barry Smith <[email protected]> wrote:
> >
> >>
> >> I tried superlu_dist again and it crashes even more quickly than MUMPS 
> >> with just the following error:
> >>
> >> ERROR: 0031-250  task 128: Killed
> >
> >   This is usually a symptom of running out of memory.
> >
> >>
> >> Absolutely nothing else is written out to either stderr or stdout. This is 
> >> with -mat_superlu_dist_statprint.
> >> The program works fine on a smaller matrix.
> >>
> >> This is the sequence of calls:
> >>
> >> KSPSetType(ksp,KSPPREONLY);
> >> PCSetType(pc,PCLU);
> >> PCFactorSetMatSolverPackage(pc,MATSOLVERSUPERLU_DIST);
> >> KSPSetFromOptions(ksp);
> >> PCSetFromOptions(pc);
> >> KSPSolve(ksp,b,x);
> >>
> >> All of these successfully return *except* the very last one to KSPSolve.
> >>
> >> Any help would be appreciated. Thanks!
> >>
> >> Samar
> >>
> >> On Feb 24, 2014, at 3:58 PM, Xiaoye S. Li <[email protected]> wrote:
> >>
> >>> Samar:
> >>> If you include the error message while crashing using superlu_dist, I 
> >>> probably know the reason.  (better yet, include the printout before the 
> >>> crash. )
> >>>
> >>> Sherry
> >>>
> >>>
> >>> On Mon, Feb 24, 2014 at 9:56 AM, Hong Zhang <[email protected]> wrote:
> >>> Samar :
> >>> There are limitations for direct solvers.
> >>> Do not expect any solver can be used on arbitrarily large problems.
> >>> Since superlu_dist also crashes, direct solvers may not be able to work 
> >>> on your application.
> >>> This is why I suggest to increase size incrementally.
> >>> You may have to experiment other type of solvers.
> >>>
> >>> Hong
> >>>
> >>> Hi Hong and Jed,
> >>>
> >>> Many thanks for replying. It would indeed be nice if the error messages 
> >>> from MUMPS were less cryptic!
> >>>
> >>> 1) I have tried smaller matrices although given how my problem is set up 
> >>> a jump is difficult to avoid. But a good idea
> >>> that I will try.
> >>>
> >>> 2) I did try various ordering but not the one you suggested.
> >>>
> >>> 3) Tracing the error through the MUMPS code suggest a rather abrupt 
> >>> termination of the program (there should be more
> >>> error messages if, for example, memory was a problem). I therefore 
> >>> thought it might be an interface problem rather than
> >>> one with mumps and turned to the petsc-users group first.
> >>>
> >>> 4) I've tried superlu_dist but it also crashes (also unclear as to why) 
> >>> at which point I decided to try mumps. The fact that both
> >>> crash would again indicate a common (memory?) problem.
> >>>
> >>> I'll try a few more things before asking the MUMPS developers.
> >>>
> >>> Thanks again for your help!
> >>>
> >>> Samar
> >>>
> >>> On Feb 24, 2014, at 11:47 AM, Hong Zhang <[email protected]> wrote:
> >>>
> >>>> Samar:
> >>>> The crash occurs in
> >>>> ...
> >>>> [161]PETSC ERROR: Error in external library!
> >>>> [161]PETSC ERROR: Error reported by MUMPS in numerical factorization 
> >>>> phase: INFO(1)=-1, INFO(2)=48
> >>>>
> >>>> for very large matrix, likely memory problem as you suspected.
> >>>> I would suggest
> >>>> 1. run problems with increased sizes (not jump from a small one to a 
> >>>> very large one) and observe memory usage using
> >>>> '-ksp_view'.
> >>>>   I see you use '-mat_mumps_icntl_14 1000', i.e., percentage of 
> >>>> estimated workspace increase. Is it too large?
> >>>>   Anyway, this input should not cause the crash, I guess.
> >>>> 2. experimenting with different matrix ordering -mat_mumps_icntl_7 <> (I 
> >>>> usually use sequential ordering 2)
> >>>>    I see you use parallel ordering -mat_mumps_icntl_29 2.
> >>>> 3. send bug report to mumps developers for their suggestion.
> >>>>
> >>>> 4. try other direct solvers, e.g., superlu_dist.
> >>>>
> >>>> …
> >>>>
> >>>> etc etc. The above error I can tell has something to do with processor 
> >>>> 48 (INFO(2)) and so forth but not the previous one.
> >>>>
> >>>> The full output enabled with -mat_mumps_icntl_4 3 looks as in the 
> >>>> attached file. Any hints as to what could be giving this
> >>>> error would be very much appreciated.
> >>>>
> >>>> I do not know how to interpret this  output file. mumps developer would 
> >>>> give you better suggestion on it.
> >>>> I would appreciate to learn as well :-)
> >>>>
> >>>> Hong
> >>>
> >>>
> >>>
> >>
> >
> 
> 

Reply via email to