Hi,
Trying to figure out if this is an maui or mpi issue. I have 48 (dual-dual core cpus) linux cluster. I have torque-2.3.3, maui-3.2.6p19, mpich2-1.07 installed. Not sure if I have maui configured correctly. What I want to do is submit an mpi job that runs one process/per node requests all 4 cores on the node and I want to submit this one process to 4 nodes.

If I request in my pbs script 1 node with 4 processors, then it works fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi output says everything ran perfect.

If I request in my pbs script 4 nodes with 4 processors then it fails: #PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job ran on 4 nodes and requests 16 processors.
But my mpi output file says it crashed:
--snippet--
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 29  node1047_40014   caused collective abort of all ranks
 exit status of rank 15: killed by signal 9
rank 13 in job 29  node1047_40014   caused collective abort of all ranks
 exit status of rank 13: killed by signal 9
rank 12 in job 29  node1047_40014   caused collective abort of all ranks
 exit status of rank 12: return code 0
--snippet--

Maui.cfg pertinent info:
JOBPRIOACCRUALPOLOCY    ALWAYS # accrue priority as soon as job is submitted
JOBNODEMATCHPOLICY      EXACTNODE
NODEALLOCATIONPOLICY    MINRESOURCE
NODEACCESSPOLICY        SHARED

/var/spool/torque/server_priv/nodes file
node1048 np=4
etc

torque queue info:
set queue spartans queue_type = Execution
set queue spartans resources_default.neednodes = spartans
set queue spartans resources_default.nodes = 1
set queue spartans enabled = True
set queue spartans started = True

Anyone know why my mpi job is crashing? Or if this is an maui/torque or mpi issue?

--

Thanks
Mary Ellen

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to