I apologize in advance for any naivete. I'm new to running largeish-scale FEM 
problems.

I have written a code intended to find the elastic properties of the unit cell 
of a microstructure rendered as a finite element mesh. [FWIW, basically I'm 
trying to solve equation (19) from Composites: Part A 32 (2001) 1291-1301.] The 
method of partitioning is kind of crude. The format of the file that the code 
reads in to determine element connectivity is the same as the format used by 
mpmetis. I run mpmetis on the element connectivity file, which produces two 
files indicating the partitioning of the nodes and elements. Each process then 
incrementally reads in the files used to define the mesh, using the output 
files from mpmetis to determine which nodes and elements to store in memory. 
(Yes, I've become aware that there are problems with multiple processes reading 
the same file, especially when the file is large.) For meshes with 50 x 50 x 50 
or 100 x 100 x100 elements, the code seems to work reasonably well. The 
100x100x100 mesh has run on a single 8-core node with 20 GB of RAM, for about 
125,000 elements per core. If I try a mesh with 400 x 400 x 400 elements, I 
start running into problems.

On one cluster (the same cluster on which I ran the 100x100x100 mesh), the 
400x400x400 mesh wouldn't finish its run on 512 cores, which seems odd to me 
since (1) the number of elements per core is about the same as the case where 
the 100x100x100 mesh ran on 8 nodes and (2) an earlier version of the code 
using Epetra from Trilinos did work on that many cores. This might just be an 
issue with me preallocating too many non-zeros and running out of memory, 
though I'm not sure why that wouldn't have been a problem for the 100x100x100 
run. On 512 cores, the code dies as it loops over the local elements to 
assemble its part of the global stiffness matrix. On 608 cores, the code dies 
differently. It finishes looping over the elements, but dies with "RETRY 
EXCEEDED" errors from OpenMPI. For 800 and 1024 cores, the code appears to work 
again. FYI, this cluster has 4x DDR Infiniband interconnects. (I don't know 
what "4x DDR" means, but maybe someone else does.)

On a different--and newer--cluster, I get no joy with the 400x400x400 mesh at 
all. This cluster has 16 cores and 64 GB of RAM per node, and FDR-10 Infiniband 
interconnects. For 512 and 608 cores, the code seems to die as it loops over 
the elements, while for 704, 800, and 912 cores, the code finishes its calls to 
MatAssemblyBegin(), but during the calls to MatAssemblyEnd(), I get thousands 
of warning messages from OpenMPI, saying "rdma_recv.c:578  MXM WARN  Many RX 
drops detected. Application performance may be affected". On this cluster, the 
Trilinos version of the code worked. even at 512 cores.

For a 500x500x500 mesh, I have no luck on either cluster with PETSc, and only 
one cluster (the one with 16 cores per node) seems to work with the Trilinos 
version of the code. (It was actually the failures with the 500x500x500 mesh 
that led me to rewrite the relevant parts of the code using PETSc. For the 
cluster with 8 cores per node, running a 500x500x500 mesh on 1024 cores, the 
code usually dies during the calls to MatAssemblyEnd(), spewing out "RETRY 
EXCEEDED" errors from OpenMPI. I have done one trial with 1504 cores on the 
8-core/node cluster, but it seems to have died before the code even starts. On 
the other cluster, I've tried cases with 1024 and 1504 cores, and during the 
calls to MatAssemblyEnd(), I get thousands of warning messages from OpenMPI, 
saying "rdma_recv.c:578  MXM WARN  Many RX drops detected. Application 
performance may be affected." (Interestingly enough, in the output from the 
Trilinos version of my code, running on 1024 cores, I get one warning from 
OpenMPI about the RX drops, but the code appears to have finished successfully 
and gotten reasonably results.)

I'm trying to make sense of what barriers I'm hitting here. If the main problem 
is, say, that the MatAssemblyXXX() calls entail a lot of communication, then 
why would increasing the number of cores solve the problem for the 400x400x400 
case? And why does increasing the number of cores not seem to help for the 
other cluster? Am I doing something naive here? (Or maybe the question should 
be what naive things am I doing here.)

Reply via email to