Joshua, I'm willing to bet this is an issue with the mpp solver and good luck trying to get any help from LSTC. We have ran into so many quirks with the mpp solver. Most of our work are FE models.
Mark McCardell Computer Systems Engineer Center for Applied Biomechanics - University of Virginia www.CenterForAppliedBiomechanics.org -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Joshua Baker-LePain Sent: Wednesday, March 14, 2007 9:33 AM To: [email protected] Subject: [Beowulf] Weird problem with mpp-dyna I have a user trying to run a coupled structural thermal analsis using mpp-dyna (mpp971_d_7600.2.398). The underlying OS is centos-4 on x86_64 hardware. We use our cluster largely as a COW, so all the cluster nodes have both public and private network interfaces. All MPI traffic is passed on the private network. Running a simulation via 'mpirun -np 12' works just fine. Running the same sim (on the same virtual machine, even, i.e. in the same 'lamboot' session) with -np > 12 leads to the following output: Performing Decomposition -- Phase 3 03/12/2007 11:47:53 *** Error the number of solid elements 13881 defined on the thermal generation control card is greater than the total number of solids in the model 12984 *** Error the number of solid elements 13929 defined on the thermal generation control card is greater than the total number of solids in the model 12985 connect to address $ADDRESS: Connection timed out connect to address $ADDRESS: Connection timed out where $ADDRESS is the IP address of the *public* interface of the node on which the job was launched. Has anybody seen anything like this? Any ideas on why it would fail over a specific number of CPUs? Note that the failure is CPU dependent, not node-count dependent. I've tried on clusters made of both dual-CPU machines and quad-CPU machines, and in both cases it took 13 CPUs to create the failure. Note also that I *do* have a user writing his own MPI code, and he has no issues running on >12 CPUs. Thanks. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
