Anders Logg wrote: > On Wed, Oct 07, 2009 at 05:39:05PM +0200, Patrick Riesen wrote: >> hi, i catched up with dolfin 0.9.3 on my linux workstation. install & >> compile went fine and running demos in serial seems to be ok as well. >> i was trying to run the demos in parallel but i get errors with openmpi >> as follows: >> (this occured running any dem with mpirun -np xy ./demo where xy is >> larger than 1, it did not occur for -np 1): >> >> ------------ >> {process output.....} >> >> then suddenly >> >> [vierzack01:12050] *** An error occurred in MPI_Barrier >> [vierzack01:12049] *** An error occurred in MPI_Barrier >> [vierzack01:12049] *** on communicator MPI_COMM_WORLD >> [vierzack01:12049] *** MPI_ERR_COMM: invalid communicator >> [vierzack01:12049] *** MPI_ERRORS_ARE_FATAL (goodbye) >> [vierzack01:12050] *** on communicator MPI_COMM_WORLD >> [vierzack01:12050] *** MPI_ERR_COMM: invalid communicator >> [vierzack01:12050] *** MPI_ERRORS_ARE_FATAL (goodbye) >> [vierzack01:12049] *** Process received signal *** >> [vierzack01:12049] Signal: Segmentation fault (11) >> [vierzack01:12049] Signal code: Address not mapped (1) >> [vierzack01:12049] Failing at address: 0x4 >> [vierzack01:12050] *** Process received signal *** >> [vierzack01:12050] Signal: Segmentation fault (11) >> [vierzack01:12050] Signal code: Address not mapped (1) >> [vierzack01:12050] Failing at address: 0x4 >> [vierzack01:12049] [ 0] /lib/libpthread.so.0 [0x7f0fd3be6410] >> [vierzack01:12049] [ 1] >> /home/priesen/num/openmpi/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x34) >> [0x7f0fd475c1d4] >> [vierzack01:12049] [ 2] >> /home/priesen/num/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x11b) >> [0x7f0fd48a8b0b] >> [vierzack01:12049] [ 3] >> /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManager12finalize_mpiEv+0x35) >> [0x7f0fd7bbfb15] >> [vierzack01:12049] [ 4] >> /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManagerD1Ev+0xe) >> [0x7f0fd7bbfb2e] >> [vierzack01:12049] [ 5] /lib/libc.so.6(__cxa_finalize+0x6c) [0x7f0fd39cee0c] >> [vierzack01:12049] [ 6] >> /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0 [0x7f0fd7aa65d3] >> [vierzack01:12049] *** End of error message *** >> [vierzack01:12050] [ 0] /lib/libpthread.so.0 [0x7fd707916410] >> [vierzack01:12050] [ 1] >> /home/priesen/num/openmpi/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x34) >> [0x7fd70848c1d4] >> [vierzack01:12050] [ 2] >> /home/priesen/num/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x11b) >> [0x7fd7085d8b0b] >> [vierzack01:12050] [ 3] >> /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManager12finalize_mpiEv+0x35) >> [0x7fd70b8efb15] >> [vierzack01:12050] [ 4] >> /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManagerD1Ev+0xe) >> [0x7fd70b8efb2e] >> [vierzack01:12050] [ 5] /lib/libc.so.6(__cxa_finalize+0x6c) [0x7fd7076fee0c] >> [vierzack01:12050] [ 6] >> /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0 [0x7fd70b7d65d3] >> [vierzack01:12050] *** End of error message *** >> mpirun noticed that job rank 0 with PID 12049 on node vierzack01 exited >> on signal 15 (Terminated). >> 1 additional process aborted (not shown) >> ------------ >> >> Is this a openmpi error? >> Is there a specific version of openmpi required for dolfin? >> mine is 1.2.8 and it worked up to dolfin 0.9.2 > > No idea what goes wrong. My version of OpenMPI is 1.3.2-3ubuntu1.
Hi, so i installed openmpi-1.3.3 and it's still the same problem. i tried to catch the error, here is a backtrace from attaching gdb via petsc and having dolfin in debug mode: #0 0x00007ff27dd7c07b in raise () from /lib/libc.so.6 #1 0x00007ff27dd7d84e in abort () from /lib/libc.so.6 #2 0x00007ff27f926ea8 in Petsc_MPI_AbortOnError (comm=0x7fff8a060448, flag=0x7fff8a060434) at init.c:142 #3 0x00007ff27ec44e0f in ompi_errhandler_invoke () from /home/priesen/num/openmpi-1.3.3/lib/libmpi.so.0 #4 0x00007ff281d83714 in ParMETIS_V3_PartMeshKway () from /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0 #5 0x00007ff281c8ba0b in dolfin::MeshPartitioning::compute_partition ( cell_partiti...@0x7fff8a0a8900, mesh_da...@0x7fff8a0a8970) at dolfin/mesh/MeshPartitioning.cpp:588 #6 0x00007ff281c8bc41 in dolfin::MeshPartitioning::partition ( me...@0x7fff8a0a8b50, mesh_da...@0x7fff8a0a8970) at dolfin/mesh/MeshPartitioning.cpp:74 #7 0x00007ff281c6fb42 in Mesh (this=0x7fff8a0a8b50, filena...@0x7fff8a0a9cd0) at dolfin/mesh/Mesh.cpp:67 #8 0x0000000000429b60 in main () frame 5 seems to interesting, , so : (gdb) f 5 #5 0x00007ff281c8ba0b in dolfin::MeshPartitioning::compute_partition ( cell_partiti...@0x7fff8a0a8900, mesh_da...@0x7fff8a0a8970) at dolfin/mesh/MeshPartitioning.cpp:588 588 &edgecut, part, &(*comm)); and then the lines: (gdb) l 583 // Call ParMETIS to partition mesh 584 ParMETIS_V3_PartMeshKway(elmdist, eptr, eind, 585 elmwgt, &wgtflag, &numflag, &ncon, 586 &ncommonnodes, &nparts, 587 tpwgts, ubvec, options, 588 &edgecut, part, &(*comm)); 589 info("Partitioned mesh, edge cut is %d.", edgecut); 590 591 // Copy mesh_data 592 cell_partition.clear(); when i check the input arguments, there is elmwgt which has no address: (gdb) p elmwgt $4 = (int *) 0x0 (gdb) p *elmwgt Cannot access memory at address 0x0 so, here i do not know any further, please tell me what i could possibly else check to determine what goes wrong or maybe you know it already. regards, patrick > > DOLFIN wasn't parallel before 0.9.3 so I'm not sure what you mean by > it working up to 0.9.2. > > -- > Anders > > > ------------------------------------------------------------------------ > > _______________________________________________ > FEniCS-users mailing list > fenics-us...@fenics.org > http://fenics.org/mailman/listinfo/fenics-users _______________________________________________ DOLFIN-dev mailing list DOLFIN-dev@fenics.org http://www.fenics.org/mailman/listinfo/dolfin-dev