Hi Richard: After you lambooted the machines, what does 'lamnodes' give you?
Cheers, Bernard > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Richard Toland > Sent: Sunday, July 04, 2004 7:30 > To: [EMAIL PROTECTED] > Subject: [Oscar-users] linpack failure after switch to RH9 > > After switching my cluster to RH9 and oscar 3.0, I recompiled > linpack and now xhpl will not run with mpirun -np 6 xhpl > > Here is what I have done: > > ran make -k check, all tests pass for lam7.0 > > I cannot run lamboot from the head node (I want to execute > here, but don't want it to be in the computing cluster) so > the only thing I know to do is to ssh to the first node and > run lamboot there Here is an excerpt from lamboot -b -v -d ./machines: > > n0<16610> ssi:boot:base:server: connecting to lamd at > 192.168.113.4:36988 n0<16610> ssi:boot:base:server: connected > n0<16610> ssi:boot:base:server: sending number of links (6) > n0<16610> ssi:boot:base:server: sending info: n0 (node1) > n0<16610> ssi:boot:base:server: sending info: n1 (node2) > n0<16610> ssi:boot:base:server: sending info: n2 (node3) > n0<16610> ssi:boot:base:server: sending info: n3 (node4) > n0<16610> ssi:boot:base:server: sending info: n4 (node5) > n0<16610> ssi:boot:base:server: sending info: n5 (node6) > n0<16610> ssi:boot:base:server: finished sending n0<16610> > ssi:boot:base:server: disconnected from 192.168.113.4:36988 > n0<16610> ssi:boot:base:server: connecting to lamd at > 192.168.113.5:38039 n0<16610> ssi:boot:base:server: connected > n0<16610> ssi:boot:base:server: sending number of links (6) > n0<16610> ssi:boot:base:server: sending info: n0 (node1) > n0<16610> ssi:boot:base:server: sending info: n1 (node2) > n0<16610> ssi:boot:base:server: sending info: n2 (node3) > n0<16610> ssi:boot:base:server: sending info: n3 (node4) > n0<16610> ssi:boot:base:server: sending info: n4 (node5) > n0<16610> ssi:boot:base:server: sending info: n5 (node6) > n0<16610> ssi:boot:base:server: finished sending n0<16610> > ssi:boot:base:server: disconnected from 192.168.113.5:38039 > n0<16610> ssi:boot:base:server: connecting to lamd at > 192.168.113.6:39286 n0<16610> ssi:boot:base:server: connected > n0<16610> ssi:boot:base:server: sending number of links (6) > n0<16610> ssi:boot:base:server: sending info: n0 (node1) > n0<16610> ssi:boot:base:server: sending info: n1 (node2) > n0<16610> ssi:boot:base:server: sending info: n2 (node3) > n0<16610> ssi:boot:base:server: sending info: n3 (node4) > n0<16610> ssi:boot:base:server: sending info: n4 (node5) > n0<16610> ssi:boot:base:server: sending info: n5 (node6) > n0<16610> ssi:boot:base:server: finished sending n0<16610> > ssi:boot:base:server: disconnected from 192.168.113.6:39286 > n0<16610> ssi:boot:base:server: connecting to lamd at > 192.168.113.7:40636 n0<16610> ssi:boot:base:server: connected > n0<16610> ssi:boot:base:server: sending number of links (6) > n0<16610> ssi:boot:base:server: sending info: n0 (node1) > n0<16610> ssi:boot:base:server: sending info: n1 (node2) > n0<16610> ssi:boot:base:server: sending info: n2 (node3) > n0<16610> ssi:boot:base:server: sending info: n3 (node4) > n0<16610> ssi:boot:base:server: sending info: n4 (node5) > n0<16610> ssi:boot:base:server: sending info: n5 (node6) > n0<16610> ssi:boot:base:server: finished sending n0<16610> > ssi:boot:base:server: disconnected from 192.168.113.7:40636 > n0<16610> ssi:boot:base:linear: finished n0<16610> > ssi:boot:rsh: all RTE procs started n0<16610> ssi:boot:rsh: > finalizing n0<16610> ssi:boot: Closing > > a manual run of mpirun -np 6 xhpl yields: > > [EMAIL PROTECTED] lavahead]$ mpirun -np 6 xhpl HPL ERROR from > process # 0, on line 419 of function HPL_pdinfo: > >>> Need at least 6 processes for these tests <<< > > HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: > >>> Illegal input in file HPL.dat. Exiting ... <<< > > HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: > >>> Need at least 6 processes for these tests <<< > > HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: > >>> Illegal input in file HPL.dat. Exiting ... <<< > > HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: > >>> Need at least 6 processes for these tests <<< > > HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: > >>> Illegal input in file HPL.dat. Exiting ... <<< > > HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: > >>> Need at least 6 processes for these tests <<< > > HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: > >>> Illegal input in file HPL.dat. Exiting ... <<< > > HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: > >>> Need at least 6 processes for these tests <<< > > HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: > >>> Illegal input in file HPL.dat. Exiting ... <<< > > -------------------------------------------------------------- > --------------- > It seems that [at least] one of the processes that was > started with mpirun did not invoke MPI_INIT before quitting > (it is possible that more than one process did not invoke > MPI_INIT -- mpirun was only notified of the first one, which > was on node n0). > > mpirun can *only* be used with MPI programs (i.e., programs > that invoke MPI_INIT and MPI_FINALIZE). You can use the > "lamexec" program to run non-MPI programs over the lambooted nodes. > -------------------------------------------------------------- > --------------- > with RH7.1 and Oscar 2.1 I could qsub this in a pbs, start a > lamboot from the head node, execute the mpirun, then lamhalt. > > help > ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
