Hi Richard:

After you lambooted the machines, what does 'lamnodes' give you?

Cheers,

Bernard 

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Richard Toland
> Sent: Sunday, July 04, 2004 7:30
> To: [EMAIL PROTECTED]
> Subject: [Oscar-users] linpack failure after switch to RH9
> 
> After switching my cluster to RH9 and oscar 3.0, I recompiled 
> linpack and now xhpl will not run with mpirun -np 6 xhpl
>  
> Here is what I have done:
>  
> ran make -k check, all tests pass for lam7.0
>  
> I cannot run lamboot from the head node (I want to execute 
> here, but don't want it to be in the computing cluster) so 
> the only thing I know to do is to ssh to the first node and 
> run lamboot there Here is an excerpt from lamboot -b -v -d ./machines:
>  
> n0<16610> ssi:boot:base:server: connecting to lamd at 
> 192.168.113.4:36988 n0<16610> ssi:boot:base:server: connected 
> n0<16610> ssi:boot:base:server: sending number of links (6) 
> n0<16610> ssi:boot:base:server: sending info: n0 (node1) 
> n0<16610> ssi:boot:base:server: sending info: n1 (node2) 
> n0<16610> ssi:boot:base:server: sending info: n2 (node3) 
> n0<16610> ssi:boot:base:server: sending info: n3 (node4) 
> n0<16610> ssi:boot:base:server: sending info: n4 (node5) 
> n0<16610> ssi:boot:base:server: sending info: n5 (node6) 
> n0<16610> ssi:boot:base:server: finished sending n0<16610> 
> ssi:boot:base:server: disconnected from 192.168.113.4:36988 
> n0<16610> ssi:boot:base:server: connecting to lamd at 
> 192.168.113.5:38039 n0<16610> ssi:boot:base:server: connected 
> n0<16610> ssi:boot:base:server: sending number of links (6) 
> n0<16610> ssi:boot:base:server: sending info: n0 (node1) 
> n0<16610> ssi:boot:base:server: sending info: n1 (node2) 
> n0<16610> ssi:boot:base:server: sending info: n2 (node3) 
> n0<16610> ssi:boot:base:server: sending info: n3 (node4) 
> n0<16610> ssi:boot:base:server: sending info: n4 (node5) 
> n0<16610> ssi:boot:base:server: sending info: n5 (node6) 
> n0<16610> ssi:boot:base:server: finished sending n0<16610> 
> ssi:boot:base:server: disconnected from 192.168.113.5:38039 
> n0<16610> ssi:boot:base:server: connecting to lamd at 
> 192.168.113.6:39286 n0<16610> ssi:boot:base:server: connected 
> n0<16610> ssi:boot:base:server: sending number of links (6) 
> n0<16610> ssi:boot:base:server: sending info: n0 (node1) 
> n0<16610> ssi:boot:base:server: sending info: n1 (node2) 
> n0<16610> ssi:boot:base:server: sending info: n2 (node3) 
> n0<16610> ssi:boot:base:server: sending info: n3 (node4) 
> n0<16610> ssi:boot:base:server: sending info: n4 (node5) 
> n0<16610> ssi:boot:base:server: sending info: n5 (node6) 
> n0<16610> ssi:boot:base:server: finished sending n0<16610> 
> ssi:boot:base:server: disconnected from 192.168.113.6:39286 
> n0<16610> ssi:boot:base:server: connecting to lamd at 
> 192.168.113.7:40636 n0<16610> ssi:boot:base:server: connected 
> n0<16610> ssi:boot:base:server: sending number of links (6) 
> n0<16610> ssi:boot:base:server: sending info: n0 (node1) 
> n0<16610> ssi:boot:base:server: sending info: n1 (node2) 
> n0<16610> ssi:boot:base:server: sending info: n2 (node3) 
> n0<16610> ssi:boot:base:server: sending info: n3 (node4) 
> n0<16610> ssi:boot:base:server: sending info: n4 (node5) 
> n0<16610> ssi:boot:base:server: sending info: n5 (node6) 
> n0<16610> ssi:boot:base:server: finished sending n0<16610> 
> ssi:boot:base:server: disconnected from 192.168.113.7:40636 
> n0<16610> ssi:boot:base:linear: finished n0<16610> 
> ssi:boot:rsh: all RTE procs started n0<16610> ssi:boot:rsh: 
> finalizing n0<16610> ssi:boot: Closing
> 
> a manual run of mpirun -np 6 xhpl yields:
>  
> [EMAIL PROTECTED] lavahead]$ mpirun -np 6 xhpl HPL ERROR from 
> process # 0, on line 419 of function HPL_pdinfo:
> >>> Need at least 6 processes for these tests <<<
>  
> HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
> >>> Illegal input in file HPL.dat. Exiting ... <<<
>  
> HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
> >>> Need at least 6 processes for these tests <<<
>  
> HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
> >>> Illegal input in file HPL.dat. Exiting ... <<<
>  
> HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
> >>> Need at least 6 processes for these tests <<<
>  
> HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
> >>> Illegal input in file HPL.dat. Exiting ... <<<
>  
> HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
> >>> Need at least 6 processes for these tests <<<
>  
> HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
> >>> Illegal input in file HPL.dat. Exiting ... <<<
>  
> HPL ERROR from process # 0, on line 419 of function HPL_pdinfo:
> >>> Need at least 6 processes for these tests <<<
>  
> HPL ERROR from process # 0, on line 621 of function HPL_pdinfo:
> >>> Illegal input in file HPL.dat. Exiting ... <<<
>  
> --------------------------------------------------------------
> ---------------
> It seems that [at least] one of the processes that was 
> started with mpirun did not invoke MPI_INIT before quitting 
> (it is possible that more than one process did not invoke 
> MPI_INIT -- mpirun was only notified of the first one, which 
> was on node n0).
>  
> mpirun can *only* be used with MPI programs (i.e., programs 
> that invoke MPI_INIT and MPI_FINALIZE).  You can use the 
> "lamexec" program to run non-MPI programs over the lambooted nodes.
> --------------------------------------------------------------
> ---------------
> with RH7.1 and Oscar 2.1 I could qsub this in a pbs, start a 
> lamboot from the head node, execute the mpirun, then lamhalt.
>  
> help
> 


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to