Re: [FEniCS-support] broken superlu_dist

Jan Blechta Wed, 30 Apr 2014 05:34:12 -0700

On Wed, 30 Apr 2014 08:53:34 +0200
Jan Blechta <[email protected]> wrote:


> On Tue, 29 Apr 2014 22:55:13 +0200
> "Garth N. Wells" <[email protected]> wrote:
> 
> > I’ve switched the default parallel LU solver back to MUMPS and set
> > MUMPS to use AMD ordering (anything other than METIS . . ), which
> > seems to avoid MUMPS crashing when PETSc is configured with recent
> > METIS versions.
> 
> We also suffered by segfaults in METIS called by MUMPS. As I remember,
> this has something to do with library mismatch because PETSc typically
> downloads its own METIS and DOLFIN is compiled against another. I will
> ask Jaroslav Hron, who solved the issue here, and let you know.

Ok, there are few issues:

1. MUMPS segfaults with 5.1. This is no longer an issue as PETSc 3.3,
   3.4 and master downloads METIS 5.0. See
   https://bitbucket.org/petsc/petsc/commits/1b7e3bd. Also dorsal
   configures PETSc with --download-metis=1 so working METIS is picked.

2. There is some mess in rpaths in PETSc since PETSc switched from
   make-based installer to python-based installer. But this was reported
   to PETSc team (om petsc-maint so it is not available to public) and
   assigned to Satish/Jed so this will be fixed. As I understand the
   issue, the problem basically is that there remains some rpaths
   located within build dir instead of install dir in libpetsc.so or
   other libraries compiled by PETSc. We do here something like
       $ chrpath --delete $(PREFIX)/lib/libpetsc.so
   and then use LD_LIBRARY_PATH to setup runtime linking. Sure, this is
   not bullet proof, especially when one has multiple libmetis.so (one
   donwloaded by PETSc and one which DOLFIN links to).

3. Crash of MUMPS with SCOTCH 6
   http://mumps.enseeiht.fr/index.php?page=faq#19. But as of my
   experience, MUMPS does not automatically choose SCOTCH ordering.

As a result, I think that we don't need to pickup AMD ordering and let
MUMPS choose the best at run-time. It is at least working on our
system, but I'm not sure whether the workaround of 2. above is
influencing this.

Jan

> 
> Jan
> 
> > 
> > Garth
> > 
> > On 27 Mar 2014, at 11:52, Garth N. Wells <[email protected]> wrote:
> > 
> > > 
> > > On 26 Mar 2014, at 18:45, Jan Blechta <[email protected]>
> > > wrote:
> > > 
> > >> On Wed, 26 Mar 2014 17:16:13 +0100
> > >> "Garth N. Wells" <[email protected]> wrote:
> > >> 
> > >>> 
> > >>> On 26 Mar 2014, at 16:56, Jan Blechta
> > >>> <[email protected]> wrote:
> > >>> 
> > >>>> On Wed, 26 Mar 2014 16:29:11 +0100
> > >>>> "Garth N. Wells" <[email protected]> wrote:
> > >>>> 
> > >>>>> 
> > >>>>> On 26 Mar 2014, at 16:26, Jan Blechta
> > >>>>> <[email protected]> wrote:
> > >>>>> 
> > >>>>>> On Wed, 26 Mar 2014 16:16:25 +0100
> > >>>>>> Johannes Ring <[email protected]> wrote:
> > >>>>>> 
> > >>>>>>> On Wed, Mar 26, 2014 at 1:39 PM, Jan Blechta
> > >>>>>>> <[email protected]> wrote:
> > >>>>>>>> As a follow-up of 'Broken PETSc wrappers?' thread on this
> > >>>>>>>> list, can anyone reproduce incorrect (orders of magnitude)
> > >>>>>>>> norm using superlu_dist on following example? Both in
> > >>>>>>>> serial and parallel. Thanks,
> > >>>>>>> 
> > >>>>>>> This is the result I got:
> > >>>>>>> 
> > >>>>>>> Serial:
> > >>>>>>> 
> > >>>>>>> L2 norm mumps        0.611356580181
> > >>>>>>> L2 norm superlu_dist 92.4733890983
> > >>>>>>> 
> > >>>>>>> Parallel (2 processes):
> > >>>>>>> 
> > >>>>>>> L2 norm mumps        0.611356580181
> > >>>>>>> L2 norm superlu_dist 220.027905995
> > >>>>>>> L2 norm mumps        0.611356580181
> > >>>>>>> L2 norm superlu_dist 220.027905995
> > >>>>>> 
> > >>>>>> superlu_dist results are obviously wrong. Do we have broken
> > >>>>>> installations or is there something wrong with the library?
> > >>>>>> 
> > >>>>>> In the latter case I would suggest switching the default back
> > >>>>>> to MUMPS. (Additionally, MUMPS has Cholesky factorization!)
> > >>>>>> What was your motivation for switching to superlu_dist,
> > >>>>>> Garth?
> > >>>>>> 
> > >>>>> 
> > >>>>> MUMPS often fails in parallel with global dofs, and there is
> > >>>>> no indication that MUMPS developers are willing to fix bugs.
> > >>>> 
> > >>>> I'm not sure what do you mean by 'MUMPS fails’.
> > >>> 
> > >>> Crashes.
> > >>> 
> > >>>> I also observe that
> > >>>> MUMPS sometimes fails because size of work arrays estimated
> > >>>> during symbolic factorization is not sufficient for actual
> > >>>> numeric factorization with pivoting. But this is hardly a bug.
> > >>> 
> > >>> It has bugs with versions of SCOTCH. We’ve been over this
> > >>> before. What you describe above indeed isn’t a bug, but just
> > >>> poor software design in MUMPS.
> > >>> 
> > >>>> It can by
> > >>>> analyzed simply by increasing verbosity
> > >>>> 
> > >>>> PETScOptions.set('mat_mumps_icntl_4', 3)
> > >>>> 
> > >>>> and fixed by increasing '
> > >>>> work array increase percentage'
> > >>>> 
> > >>>> PETScOptions.set('mat_mumps_icntl_14', 50) # default=25
> > >>>> 
> > >>>> or decreasing pivoting threshold. I have suspicion that
> > >>>> frequent reason for this is using too small partitions (too
> > >>>> much processes). (Users should also use Cholesky and
> > >>>> PD-Cholesky whenever possible. Numerics is much more better
> > >>>> and more things are predictable in analysis phase.)
> > >>>> 
> > >>>> On the other superlu_dist is computing rubbish without any
> > >>>> warning for me and Johannes. Can you duplicate?
> > >>>> 
> > >>> 
> > >>> I haven’t had time to look. We should have unit testing for LU
> > >>> solvers. From memory I don’t think we do.
> > >> 
> > >> Ok, fix is here switch column ordering
> > >> PETScOptions.set('mat_superlu_dist_colperm', col_ordering)
> > >> 
> > >> col_ordering      | properties
> > >> --------------------------------------
> > >> NATURAL           | works, large fill-in
> > >> MMD_AT_PLUS_A     | works, smallest fill-in (for this case)
> > >> MMD_ATA           | works, reasonable fill-in
> > >> METIS_AT_PLUS_A   | computes rubish (default on my system for
> > >> this case) PARMETIS          | supported only in parallel,
> > >> computes rubish
> > >> 
> > >> or row ordering
> > >> PETScOptions.set('mat_superlu_dist_rowperm', row_ordering)
> > >> 
> > >> row_ordering      | properties
> > >> --------------------------------------
> > >> NATURAL           | works, good fill-in
> > >> LargeDiag         | computes rubish (default on my system for
> > >> this case)
> > >> 
> > >> or both.
> > >> 
> > > 
> > > Good digging. Is there anyway to know when superlu_dist is going
> > > to return garbage? It’s concerning that it can silently return a
> > > solution that is way off.
> > > 
> > > Garth
> > > 
> > >> Jan
> > >> 
> > >>> 
> > >>> Garth
> > >>> 
> > >>>> Jan
> > >>>> 
> > >>>>> 
> > >>>>> Garth
> > >>>>> 
> > >>>>>> Jan
> > >>>>>> 
> > >>>>>>> 
> > >>>>>>> Johannes
> > >>>>>>> _______________________________________________
> > >>>>>>> fenics-support mailing list
> > >>>>>>> [email protected]
> > >>>>>>> http://fenicsproject.org/mailman/listinfo/fenics-support
> > >>> 
> > >>> _______________________________________________
> > >>> fenics-support mailing list
> > >>> [email protected]
> > >>> http://fenicsproject.org/mailman/listinfo/fenics-support
> > > 
> > > _______________________________________________
> > > fenics-support mailing list
> > > [email protected]
> > > http://fenicsproject.org/mailman/listinfo/fenics-support
> > 
> > _______________________________________________
> > fenics-support mailing list
> > [email protected]
> > http://fenicsproject.org/mailman/listinfo/fenics-support
> 
> _______________________________________________
> fenics-support mailing list
> [email protected]
> http://fenicsproject.org/mailman/listinfo/fenics-support

_______________________________________________
fenics-support mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics-support

Re: [FEniCS-support] broken superlu_dist

Reply via email to