|
After switching my cluster to RH9 and oscar 3.0, I
recompiled linpack and now xhpl will not run with mpirun -np 6 xhpl
Here is what I have done:
ran make -k check, all tests pass for
lam7.0
I cannot run lamboot from the head node (I want to
execute here, but don't want it to be in the computing cluster) so the only
thing I know to do is to ssh to the first node and run lamboot there Here
is an excerpt from lamboot -b -v -d ./machines:
n0<16610> ssi:boot:base:server: connecting to
lamd at 192.168.113.4:36988
n0<16610> ssi:boot:base:server: connected n0<16610> ssi:boot:base:server: sending number of links (6) n0<16610> ssi:boot:base:server: sending info: n0 (node1) n0<16610> ssi:boot:base:server: sending info: n1 (node2) n0<16610> ssi:boot:base:server: sending info: n2 (node3) n0<16610> ssi:boot:base:server: sending info: n3 (node4) n0<16610> ssi:boot:base:server: sending info: n4 (node5) n0<16610> ssi:boot:base:server: sending info: n5 (node6) n0<16610> ssi:boot:base:server: finished sending n0<16610> ssi:boot:base:server: disconnected from 192.168.113.4:36988 n0<16610> ssi:boot:base:server: connecting to lamd at 192.168.113.5:38039 n0<16610> ssi:boot:base:server: connected n0<16610> ssi:boot:base:server: sending number of links (6) n0<16610> ssi:boot:base:server: sending info: n0 (node1) n0<16610> ssi:boot:base:server: sending info: n1 (node2) n0<16610> ssi:boot:base:server: sending info: n2 (node3) n0<16610> ssi:boot:base:server: sending info: n3 (node4) n0<16610> ssi:boot:base:server: sending info: n4 (node5) n0<16610> ssi:boot:base:server: sending info: n5 (node6) n0<16610> ssi:boot:base:server: finished sending n0<16610> ssi:boot:base:server: disconnected from 192.168.113.5:38039 n0<16610> ssi:boot:base:server: connecting to lamd at 192.168.113.6:39286 n0<16610> ssi:boot:base:server: connected n0<16610> ssi:boot:base:server: sending number of links (6) n0<16610> ssi:boot:base:server: sending info: n0 (node1) n0<16610> ssi:boot:base:server: sending info: n1 (node2) n0<16610> ssi:boot:base:server: sending info: n2 (node3) n0<16610> ssi:boot:base:server: sending info: n3 (node4) n0<16610> ssi:boot:base:server: sending info: n4 (node5) n0<16610> ssi:boot:base:server: sending info: n5 (node6) n0<16610> ssi:boot:base:server: finished sending n0<16610> ssi:boot:base:server: disconnected from 192.168.113.6:39286 n0<16610> ssi:boot:base:server: connecting to lamd at 192.168.113.7:40636 n0<16610> ssi:boot:base:server: connected n0<16610> ssi:boot:base:server: sending number of links (6) n0<16610> ssi:boot:base:server: sending info: n0 (node1) n0<16610> ssi:boot:base:server: sending info: n1 (node2) n0<16610> ssi:boot:base:server: sending info: n2 (node3) n0<16610> ssi:boot:base:server: sending info: n3 (node4) n0<16610> ssi:boot:base:server: sending info: n4 (node5) n0<16610> ssi:boot:base:server: sending info: n5 (node6) n0<16610> ssi:boot:base:server: finished sending n0<16610> ssi:boot:base:server: disconnected from 192.168.113.7:40636 n0<16610> ssi:boot:base:linear: finished n0<16610> ssi:boot:rsh: all RTE procs started n0<16610> ssi:boot:rsh: finalizing n0<16610> ssi:boot: Closing a manual run of mpirun -np 6 xhpl
yields:
[EMAIL PROTECTED] lavahead]$ mpirun -np 6 xhpl
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: >>> Need at least 6 processes for these tests <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: >>> Need at least 6 processes for these tests <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: >>> Need at least 6 processes for these tests <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: >>> Need at least 6 processes for these tests <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: >>> Need at least 6 processes for these tests <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< ----------------------------------------------------------------------------- It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. ----------------------------------------------------------------------------- with RH7.1 and Oscar 2.1 I could qsub this in a pbs, start a lamboot from the head node, execute the mpirun, then lamhalt. help
|
- RE: [Oscar-users] linpack failure after switch to RH9 Richard Toland
