That is what I thought. I am on the mpich mailing also and getting some
feed back.
Thanks to all who responded.
Mary Ellen
Greenseid, Joseph M. wrote:
#PBS -l nodes=4:ppn=4 will request four nodes with four processors per node.
#PBS -l nodes=4:ppn=1 will request four nodes with one processor per node.
the MPI problem is a separate issue...
--Joe
________________________________
From: [EMAIL PROTECTED] on behalf of Mary Ellen Fitzpatrick
Sent: Fri 10/31/2008 11:45 AM
To: [email protected]; Mary Ellen Fitzpatrick
Subject: [Mauiusers] mpi job on multi-core nodes,fails to run on multiple nodes
Hi,
Trying to figure out if this is an maui or mpi issue. I have 48
(dual-dual core cpus) linux cluster. I have torque-2.3.3,
maui-3.2.6p19, mpich2-1.07 installed. Not sure if I have maui
configured correctly. What I want to do is submit an mpi job that runs
one process/per node requests all 4 cores on the node and I want to
submit this one process to 4 nodes.
If I request in my pbs script 1 node with 4 processors, then it works
fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
output says everything ran perfect.
If I request in my pbs script 4 nodes with 4 processors then it fails:
#PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job ran
on 4 nodes and requests 16 processors.
But my mpi output file says it crashed:
--snippet--
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 15: killed by signal 9
rank 13 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 13: killed by signal 9
rank 12 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 12: return code 0
--snippet--
Maui.cfg pertinent info:
JOBPRIOACCRUALPOLOCY ALWAYS # accrue priority as soon as job is submitted
JOBNODEMATCHPOLICY EXACTNODE
NODEALLOCATIONPOLICY MINRESOURCE
NODEACCESSPOLICY SHARED
/var/spool/torque/server_priv/nodes file
node1048 np=4
etc
torque queue info:
set queue spartans queue_type = Execution
set queue spartans resources_default.neednodes = spartans
set queue spartans resources_default.nodes = 1
set queue spartans enabled = True
set queue spartans started = True
Anyone know why my mpi job is crashing? Or if this is an maui/torque or
mpi issue?
--
Thanks
Mary Ellen
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers
--
Thanks
Mary Ellen
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers