Hi,
Trying to figure out if this is an maui or mpi issue. I have 48
(dual-dual core cpus) linux cluster. I have torque-2.3.3,
maui-3.2.6p19, mpich2-1.07 installed. Not sure if I have maui
configured correctly. What I want to do is submit an mpi job that runs
one process/per node requests all 4 cores on the node and I want to
submit this one process to 4 nodes.
If I request in my pbs script 1 node with 4 processors, then it works
fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
output says everything ran perfect.
If I request in my pbs script 4 nodes with 4 processors then it fails:
#PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job ran
on 4 nodes and requests 16 processors.
But my mpi output file says it crashed:
--snippet--
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 15: killed by signal 9
rank 13 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 13: killed by signal 9
rank 12 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 12: return code 0
--snippet--
Maui.cfg pertinent info:
JOBPRIOACCRUALPOLOCY ALWAYS # accrue priority as soon as job is submitted
JOBNODEMATCHPOLICY EXACTNODE
NODEALLOCATIONPOLICY MINRESOURCE
NODEACCESSPOLICY SHARED
/var/spool/torque/server_priv/nodes file
node1048 np=4
etc
torque queue info:
set queue spartans queue_type = Execution
set queue spartans resources_default.neednodes = spartans
set queue spartans resources_default.nodes = 1
set queue spartans enabled = True
set queue spartans started = True
Anyone know why my mpi job is crashing? Or if this is an maui/torque or
mpi issue?
--
Thanks
Mary Ellen
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers