do the mpds start and exit properly when you do it this way? i've always started it from within my job file -- i do something like: #PBS -l nodes=4:ppn=4 ... mpdboot -n 4 -f $PBS_NODEFILE mpiexec ... mpdallexit it's been a while since i've used an MPI with mpds, but i thought it just needed one mpd per host (not one per processor), right? that's why i start 4 here... --Joe
________________________________ From: [EMAIL PROTECTED] on behalf of Mary Ellen Fitzpatrick Sent: Mon 11/3/2008 9:43 AM To: Joseph Hargitai; [email protected]; Mary Ellen Fitzpatrick Subject: Re: [Mauiusers] mpi job on multi-core nodes, fails to run on multiplenodes My pbs script -snippet # Request 4 processor/node #PBS -l nodes=4:ppn=4 # How many procs do I have? NP=$(wc -l $PBS_NODEFILE | awk '{print $1}') echo Number of processors is $NP mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log My output file list "Number of processors is 16" which is what I request I start all of the mpd on all of the nodes from the head node with the following command: mpdboot -n 47 -f /etc/mpd.hosts Should I be starting the mpd daemon from within my pbs script? /etc/mpd.hosts is on every compute node and lists the following: node1045:4 node1046:4 node1047:4 node1048:4 My $PBS_NODEFILE has the following: node1045 np=4 lomem spartans node1046 np=4 lomem spartans node1047 np=4 lomem spartans node1048 np=4 lomem spartans Thanks Mary Ellen Joseph Hargitai wrote: > What is in the pbs script? In most cases you need a -hostfile $PBS_NODEFILE > entry, otherwise you get all processes piled on one node ie. the job does not > know of other hosts than the one it landed on. > > > j > > ----- Original Message ----- > From: Mary Ellen Fitzpatrick <[EMAIL PROTECTED]> > Date: Friday, October 31, 2008 11:45 am > Subject: [Mauiusers] mpi job on multi-core nodes, fails to run on > multiple nodes > > >> Hi, >> Trying to figure out if this is an maui or mpi issue. I have 48 >> (dual-dual core cpus) linux cluster. I have torque-2.3.3, >> maui-3.2.6p19, mpich2-1.07 installed. Not sure if I have maui >> configured correctly. What I want to do is submit an mpi job that >> runs >> one process/per node requests all 4 cores on the node and I want to >> submit this one process to 4 nodes. >> >> If I request in my pbs script 1 node with 4 processors, then it works >> >> fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi >> >> output says everything ran perfect. >> >> If I request in my pbs script 4 nodes with 4 processors then it fails: >> >> #PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job >> ran >> on 4 nodes and requests 16 processors. >> >> But my mpi output file says it crashed: >> --snippet-- >> Initializing MPI Routines... >> Initializing MPI Routines... >> Initializing MPI Routines... >> Initializing MPI Routines... >> rank 15 in job 29 node1047_40014 caused collective abort of all ranks >> exit status of rank 15: killed by signal 9 >> rank 13 in job 29 node1047_40014 caused collective abort of all ranks >> exit status of rank 13: killed by signal 9 >> rank 12 in job 29 node1047_40014 caused collective abort of all ranks >> exit status of rank 12: return code 0 >> --snippet-- >> >> Maui.cfg pertinent info: >> JOBPRIOACCRUALPOLOCY ALWAYS # accrue priority as soon as job is submitted >> JOBNODEMATCHPOLICY EXACTNODE >> NODEALLOCATIONPOLICY MINRESOURCE >> NODEACCESSPOLICY SHARED >> >> /var/spool/torque/server_priv/nodes file >> node1048 np=4 >> etc >> >> torque queue info: >> set queue spartans queue_type = Execution >> set queue spartans resources_default.neednodes = spartans >> set queue spartans resources_default.nodes = 1 >> set queue spartans enabled = True >> set queue spartans started = True >> >> Anyone know why my mpi job is crashing? Or if this is an maui/torque >> or >> mpi issue? >> >> -- >> >> Thanks >> Mary Ellen >> >> _______________________________________________ >> mauiusers mailing list >> [email protected] >> http://www.supercluster.org/mailman/listinfo/mauiusers >> > > -- Thanks Mary Ellen _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
