This might be somewhat off topic (I apologize in advance), but I'm hoping there's some collective knowledge here that can help me figure out a problem I'm having running the Linpack (xhpl) benchmark across my cluster.

In a nutshell, I am unable to get it to run successfully on more than two nodes.

Detail follow, but let me say I may have left out some information and I'll be happy to clarify anything I'm missing here to help troubleshoot. At this point, I'm at a loss.

A bit about my setup:

    * OSCAR 4.2
    * Linux Distro: Fedora Core 3
    * Hardware type: x86 (4 nodes w/Pentium 4 HT)

I have compiled xhpl with the Goto BLAS library and it runs just fine scaling up to 2 nodes, 2 processes per node. But once I hit the 3rd node, xhpl fires up on all 3 nodes, but doesn't actually execute.

Some more detail:

See below for the content of my HPL.dat file. Parameters I don't specify in these scenarios are as displayed in the content of the file below. I have a script I run to actually handle the lamboot and xhpl execution. That script looks like this:

===================
File run_linpack.sh
===================
#!/bin/bash
lamboot -v ../bhost.txt
mpirun -v -sf C xhpl
lamhalt -v ../bhost.txt
===================
End of file run_linpack.sh
===================


Here is a listing of the scenarios (1-5: Successful, 6-9: Failure) I've tried leading up to the failure I'm experiencing:

===================
Scenario 1 - one node, one process
===================

Nodes included in bhost file: node001
Ps in HPL.dat: 1
Qs in HPL.dat: 1

Execution on node001: Success (lamboot successful, xhpl run successful)


===================
Scenario 2 - one node, two processes
===================

Nodes included in bhost file: node001, node001 (also tried "node001 cpu=2")
Ps in HPL.dat: 1
Qs in HPL.dat: 2

Execution on node001: Success (lamboot successful, xhpl run successful)


===================
Scenario 3 - two nodes, one process per node
===================

Nodes included in bhost file: node001, node002
Ps in HPL.dat: 1
Qs in HPL.dat: 2

Execution on node001: Success (lamboot successful, xhpl run successful)


===================
Scenario 4 - two nodes, three processes
===================

Nodes included in bhost file: node001 cpu=2, node002
Ps in HPL.dat: 1
Qs in HPL.dat: 3

Execution on node001: Success (lamboot successful, xhpl run successful)


===================
Scenario 5 - two nodes, two processes per node
===================

Nodes included in bhost file: node001 cpu=2, node002 cpu=2
Ps in HPL.dat: 1
Qs in HPL.dat: 4

Execution on node001: Success (lamboot successful, xhpl run successful)

Also tried:
Nodes included in bhost file: node001 cpu=2, node003 cpu=2
Execution on node001: Success (lamboot successful, xhpl run successful)

Also tried:
Nodes included in bhost file: node001 cpu=2, node004 cpu=2
Execution on node001: Success (lamboot successful, xhpl run successful)


===================
Scenario 6 - three nodes, one process per node
===================

Nodes included in bhost file: node001, node002, node003
Ps in HPL.dat: 1
Qs in HPL.dat: 3

Execution on node001: Failure (lamboot successful, xhpl startup successful on each node, but never terminates - just sits idle (no CPU)) I can Ctrl-C out of the mpirun and it terminates and lamhalt confirms successful shutdown.


===================
Scenario 7 - four nodes, one process per node
===================

Nodes included in bhost file: node001, node002, node003, node004
Ps in HPL.dat: 1
Qs in HPL.dat: 4

Execution on node001: Failure (lamboot successful, xhpl startup successful on each node, but never terminates - just sits idle (no CPU)) I can Ctrl-C out of the mpirun and it terminates and lamhalt confirms successful shutdown.


===================
Scenario 8 - three nodes, two process per node
===================

Nodes included in bhost file: node001 cpu=2, node002 cpu=2, node003 cpu=2
Ps in HPL.dat: 1
Qs in HPL.dat: 6

Execution on node001: Failure (lamboot successful, xhpl startup successful on each node, but never terminates - just sits idle (no CPU)) I can Ctrl-C out of the mpirun and it terminates and lamhalt confirms successful shutdown.

Also tried:
Ps in HPL.dat: 2
Qs in HPL.dat: 3
Execution on node001: Failure (lamboot successful, xhpl startup successful on each node, but never terminates - just sits idle (no CPU)) I can Ctrl-C out of the mpirun and it terminates and lamhalt confirms successful shutdown.


===================
Scenario 9 - four nodes, two processes per node
===================

Nodes included in bhost file: node001 cpu=2, node002 cpu=2, node003 cpu=2, node004 cpu=2
Ps in HPL.dat: 1
Qs in HPL.dat: 8

Execution on node001: Failure (lamboot successful, xhpl startup successful on each node, but never terminates - just sits idle (no CPU)) I can Ctrl-C out of the mpirun and it terminates and lamhalt confirms successful shutdown.

Also tried:
Ps in HPL.dat: 2
Qs in HPL.dat: 4
Execution on node001: Failure (lamboot successful, xhpl startup successful on each node, but never terminates - just sits idle (no CPU)) I can Ctrl-C out of the mpirun and it terminates and lamhalt confirms successful shutdown.





===================
File HPL.dat
===================
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
1000        Ns
1            # of NBs
80           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
8            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
60           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
===================
End of file HPL.dat
===================



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to