On Tue, Feb 7, 2012 at 08:06, Derek Gaston <friedmud at gmail.com> wrote:
> Hello all, > > I'm running some largish finite element calculations at the moment (50 > Million to 400 Million DoFs on up to 10,000 processors) using a code based > on PETSc (obviously!) and while most of the simulations are working well, > every now again I seem to run into a hang in the setup phase of the > simulation. > > I've attached GDB several times and it seems to alway be hanging > in PetscLayoutSetUp() during matrix creation. Here is the top of a stack > trace showing what I mean: > > #0 0x00002aac9d86cef2 in opal_progress () from > /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libopen-pl.so.0 > #1 0x00002aac9d16a0c4 in ompi_request_default_wait_all () from > /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0 > #2 0x00002aac9d1da9ee in ompi_coll_tuned_sendrecv_actual () from > /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0 > #3 0x00002aac9d1e2716 in ompi_coll_tuned_allgather_intra_bruck () from > /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0 > #4 0x00002aac9d1db439 in ompi_coll_tuned_allgather_intra_dec_fixed () > from /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0 > #5 0x00002aac9d1827e6 in PMPI_Allgather () from > /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0 > #6 0x0000000000508184 in PetscLayoutSetUp () > #7 0x00000000005b9f39 in MatMPIAIJSetPreallocation_MPIAIJ () > #8 0x00000000005c1317 in MatCreateMPIAIJ () > Are _all_ the processes making it here? > > As you can see, I'm currently using openMPI (even though I do have access > to others) along with the intel compiler (this is a mostly C++ code). This > problem doesn't exhibit itself on any smaller problems (we run TONS of runs > all the time in the 10,000-5,000,000 DoF range on 1-3000 procs) and only > seems to come up on these larger runs. > > I'm starting to suspect that it's an openMPI issue. Has anyone seen > anything like this before? > > Here are some specs for my current environment > > PETSc 3.1-p8 (I know, I know....) > OpenMPI 1.4.4 > intel compilers 12.1.1 > Modified Redhat with 2.6.18 Kernel > QDR Infiniband > > Thanks for any help! > > Derek > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120207/0d2f9230/attachment.htm>
