On Wed, 30 Jan 2019, Li Luo wrote:

> I am using libMesh for large scale parallelization. To enable the usage of
> 65536 processor cores, the options
> --with-dof-id-bytes=8 --with-processor-id-bytes=4
> --with-subdomain-id-bytes=4
> are already used for configuration.
>
> However, the code 'sticks' in the following fuction:
> Parallel::Sort<Hilbert::HilbertIndices> sorter (communicator,
> sorted_hilbert_keys);
> sorter.sort();

So, this looks horribly suspicious.  In parallel_sort.h:52, where it
defaults "IdxType=unsigned int", would you try "IdxType=dof_id_type"
instead?  That might be a red herring (the problem here would be
sorting at least 2^32 objects, not sorting them on at least 2^16
processors) but it sure looks like a bug to me and there's at least a
chance it's the bug affecting you.

> in the routing MeshCommunication::find_global_indices (in
> file mesh_communication_global_indices.C), which is called from routine
> Partitioner::partition_unpartitioned_elem (in file partitioner.C).

> Since libMesh calls libHilbert for this sort function, is there anything
> should be noticed for the configuration of libHilbert when using large
> scale parallelization?

Quite possibly.  We currently have libHilbert set to use 32-bit
integers internally.  That should be fine in theory (coordinates get
identified by triples of integers, and if you're using unique_id then
that disambiguates any contiguous nodes).  But cranking that up to 64
in contrib/libHilbert/include/Hilbert/FixBitVec.hpp would be what I'd
suggest as Plan B.

> Is that possible not to use libHilbert? If so, any efficiency
> degenerates?

You lose the ability to do N->M restarts with xdr/xda EquationSystems
output, and you lose compatibility of xdr/xda EquationSystems output
between with- and without- libHilbert libMesh compiles...  No
efficiency loss, though, and I don't think either of those features
can scale up to your processor count anyway, so trying with libHilbert
disabled (it's a configure option) should be your plan C.

There might also be an inadvertent libHilbert dependency somewhere
when using "slit meshes" or anything else that gives multiple
topologically distinct nodes the exact same geometric coordinates - I
don't think this is the case but it's a possible bug to watch out for.

Thanks for the bug report, and please keep us up to speed with what
works or fails to fix it!
---
Roy


_______________________________________________
Libmesh-users mailing list
Libmesh-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to