Re: [Mauiusers] mpi job on multi-core nodes, fails to run on multiplenodes: RESOLVED

Mary Ellen Fitzpatrick Mon, 03 Nov 2008 11:50:57 -0800

Yes the mpds start and exit without issue when I start them from my headnode.

I was able to resolve my issue to adding the global mpiexec variable tomy command.

From within my pbs script I was running and it would give the rankabort error:

mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log

I added the global mpiexec variable "-machinefile $PBS_NODEFILE rightafter the call for mpiexec and it worked.mpiexec -machinefile $PBS_NODEFILE -n $NP dock6.mpi -i dock.in -odock.out &> dock.log

My error (well one of them anyway :) ) was that because I had the/etc/mpd.hosts file on each node with the node list:ppn info, that itwas being read. But apparantly not. The pbs script prefers the infofrom the $PBS_NODEFILE instead.


Thanks to all who responded and I hope this info is helpful to others.
Mary Ellen



Greenseid, Joseph M. wrote:

do the mpds start and exit properly when you do it this way?  i've always 
started it from within my job file -- i do something like:

#PBS -l nodes=4:ppn=4

...
mpdboot -n 4 -f $PBS_NODEFILE
mpiexec ...
mpdallexit

it's been a while since i've used an MPI with mpds, but i thought it just needed one mpd per host (not one per processor), right? that's why i start 4 here...--Joe


________________________________

From: [EMAIL PROTECTED] on behalf of Mary Ellen Fitzpatrick
Sent: Mon 11/3/2008 9:43 AM
To: Joseph Hargitai; [email protected]; Mary Ellen Fitzpatrick
Subject: Re: [Mauiusers] mpi job on multi-core nodes, fails to run on 
multiplenodes



My pbs script
-snippet
# Request 4 processor/node
#PBS -l nodes=4:ppn=4

# How many procs do I have?
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo Number of processors is $NP

mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log

My output file list "Number of processors is 16" which is what I request


I start all of the mpd on all of the nodes from the head node with the
following command:
mpdboot -n 47 -f /etc/mpd.hosts

Should I be starting the mpd daemon from within my pbs script?

/etc/mpd.hosts is on every compute node and lists the following:
node1045:4
node1046:4
node1047:4
node1048:4

My $PBS_NODEFILE has the following:
node1045 np=4 lomem spartans
node1046 np=4 lomem spartans
node1047 np=4 lomem spartans
node1048 np=4 lomem spartans

Thanks
Mary Ellen

Joseph Hargitai wrote:

What is in the pbs script? In most cases you need a -hostfile $PBS_NODEFILE  
entry, otherwise you get all processes piled on one node ie. the job does not 
know of other hosts than the one it landed on.

j

----- Original Message -----
From: Mary Ellen Fitzpatrick <[EMAIL PROTECTED]>
Date: Friday, October 31, 2008 11:45 am
Subject: [Mauiusers] mpi job on multi-core nodes,     fails to run on multiple 
nodes

Hi,
Trying to figure out if this is an maui or mpi issue.  I have 48
(dual-dual core cpus) linux cluster.  I have torque-2.3.3,
maui-3.2.6p19, mpich2-1.07 installed.  Not sure if I have maui
configured correctly.  What I want to do is submit an mpi job that
runs
one process/per node requests all 4 cores on the node and I want to
submit this one process to 4 nodes.

If I request in my pbs script 1 node with 4 processors, then it works

fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi

output says everything ran perfect.

If I request in my pbs script 4 nodes with 4 processors then it fails:

#PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job
ran
on 4 nodes and requests 16 processors.

But my mpi output file says it crashed:
--snippet--
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 29  node1047_40014   caused collective abort of all ranks
  exit status of rank 15: killed by signal 9
rank 13 in job 29  node1047_40014   caused collective abort of all ranks
  exit status of rank 13: killed by signal 9
rank 12 in job 29  node1047_40014   caused collective abort of all ranks
  exit status of rank 12: return code 0
--snippet--

Maui.cfg pertinent info:
JOBPRIOACCRUALPOLOCY    ALWAYS # accrue priority as soon as job is submitted
JOBNODEMATCHPOLICY      EXACTNODE
NODEALLOCATIONPOLICY    MINRESOURCE
NODEACCESSPOLICY        SHARED

/var/spool/torque/server_priv/nodes file
node1048 np=4
etc

torque queue info:
set queue spartans queue_type = Execution
set queue spartans resources_default.neednodes = spartans
set queue spartans resources_default.nodes = 1
set queue spartans enabled = True
set queue spartans started = True

Anyone know why my mpi job is crashing?  Or if this is an maui/torque
or
mpi issue?

--

Thanks
Mary Ellen

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers


--
Thanks
Mary Ellen

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers


--
Thanks
Mary Ellen

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Re: [Mauiusers] mpi job on multi-core nodes, fails to run on multiplenodes: RESOLVED

Reply via email to