After switching my cluster to RH9 and oscar 3.0, I recompiled linpack and now xhpl will not run with mpirun -np 6 xhpl
 
Here is what I have done:
 
ran make -k check, all tests pass for lam7.0
 
I cannot run lamboot from the head node (I want to execute here, but don't want it to be in the computing cluster) so the only thing I know to do is to ssh to the first node and run lamboot there Here is an excerpt from lamboot -b -v -d ./machines:
 
n0<16610> ssi:boot:base:server: connecting to lamd at 192.168.113.4:36988
n0<16610> ssi:boot:base:server: connected
n0<16610> ssi:boot:base:server: sending number of links (6)
n0<16610> ssi:boot:base:server: sending info: n0 (node1)
n0<16610> ssi:boot:base:server: sending info: n1 (node2)
n0<16610> ssi:boot:base:server: sending info: n2 (node3)
n0<16610> ssi:boot:base:server: sending info: n3 (node4)
n0<16610> ssi:boot:base:server: sending info: n4 (node5)
n0<16610> ssi:boot:base:server: sending info: n5 (node6)
n0<16610> ssi:boot:base:server: finished sending
n0<16610> ssi:boot:base:server: disconnected from 192.168.113.4:36988
n0<16610> ssi:boot:base:server: connecting to lamd at 192.168.113.5:38039
n0<16610> ssi:boot:base:server: connected
n0<16610> ssi:boot:base:server: sending number of links (6)
n0<16610> ssi:boot:base:server: sending info: n0 (node1)
n0<16610> ssi:boot:base:server: sending info: n1 (node2)
n0<16610> ssi:boot:base:server: sending info: n2 (node3)
n0<16610> ssi:boot:base:server: sending info: n3 (node4)
n0<16610> ssi:boot:base:server: sending info: n4 (node5)
n0<16610> ssi:boot:base:server: sending info: n5 (node6)
n0<16610> ssi:boot:base:server: finished sending
n0<16610> ssi:boot:base:server: disconnected from 192.168.113.5:38039
n0<16610> ssi:boot:base:server: connecting to lamd at 192.168.113.6:39286
n0<16610> ssi:boot:base:server: connected
n0<16610> ssi:boot:base:server: sending number of links (6)
n0<16610> ssi:boot:base:server: sending info: n0 (node1)
n0<16610> ssi:boot:base:server: sending info: n1 (node2)
n0<16610> ssi:boot:base:server: sending info: n2 (node3)
n0<16610> ssi:boot:base:server: sending info: n3 (node4)
n0<16610> ssi:boot:base:server: sending info: n4 (node5)
n0<16610> ssi:boot:base:server: sending info: n5 (node6)
n0<16610> ssi:boot:base:server: finished sending
n0<16610> ssi:boot:base:server: disconnected from 192.168.113.6:39286
n0<16610> ssi:boot:base:server: connecting to lamd at 192.168.113.7:40636
n0<16610> ssi:boot:base:server: connected
n0<16610> ssi:boot:base:server: sending number of links (6)
n0<16610> ssi:boot:base:server: sending info: n0 (node1)
n0<16610> ssi:boot:base:server: sending info: n1 (node2)
n0<16610> ssi:boot:base:server: sending info: n2 (node3)
n0<16610> ssi:boot:base:server: sending info: n3 (node4)
n0<16610> ssi:boot:base:server: sending info: n4 (node5)
n0<16610> ssi:boot:base:server: sending info: n5 (node6)
n0<16610> ssi:boot:base:server: finished sending
n0<16610> ssi:boot:base:server: disconnected from 192.168.113.7:40636
n0<16610> ssi:boot:base:linear: finished
n0<16610> ssi:boot:rsh: all RTE procs started
n0<16610> ssi:boot:rsh: finalizing
n0<16610> ssi:boot: Closing
a manual run of mpirun -np 6 xhpl yields:
 
[EMAIL PROTECTED] lavahead]$ mpirun -np 6 xhpl
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 6 processes for these tests <<<
 
HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<
 
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 6 processes for these tests <<<
 
HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<
 
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 6 processes for these tests <<<
 
HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<
 
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 6 processes for these tests <<<
 
HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<
 
HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
>>> Need at least 6 processes for these tests <<<
 
HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
>>> Illegal input in file HPL.dat. Exiting ... <<<
 
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
 
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
with RH7.1 and Oscar 2.1 I could qsub this in a pbs, start a lamboot from the head node, execute the mpirun, then lamhalt.
 
help

Reply via email to