Ralph, The mods may have been done by the staff at PSC rather than by SGI. Note the "_psc" suffix: $ which pbsnodes /usr/local/packages/torque/2.3.13_psc/bin/pbsnodes
Their sources appear to be available in the f/s too. Using "tar -d" to compare that to the pristine torque-2.3.13 tarball show the following files were modified: torque-2.3.13/src/resmom/job_func.c torque-2.3.13/src/resmom/mom_main.c torque-2.3.13/src/resmom/requests.c torque-2.3.13/src/resmom/linux/mom_mach.h torque-2.3.13/src/resmom/linux/mom_mach.c torque-2.3.13/src/resmom/linux/cpuset.c torque-2.3.13/src/resmom/start_exec.c torque-2.3.13/src/scheduler.tcl/pbs_sched.c torque-2.3.13/src/cmds/qalter.c torque-2.3.13/src/cmds/qsub.c torque-2.3.13/src/cmds/qstat.c torque-2.3.13/src/server/resc_def_all.c torque-2.3.13/src/server/req_quejob.c torque-2.3.13/torque.spec I'll provide what assistance I can in testing. That includes providing (off-list) the actual diffs of PSC's torque against the tarball, if desired. In the meantime, since -npernode didn't work, what is the right way to say: "I have 1 slot but I want to overcommit and run 16 mpi ranks". -Paul On Fri, Jan 31, 2014 at 3:20 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Jan 31, 2014, at 3:13 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Ralph, > > As I said this is NOT a cluster - it is a 4k-core shared memory machine. > > > I understood - that wasn't the nature of my question > > TORQUE is allocating cpus (time-shared mode, IIRC), not nodes. > So, there is always exactly one line in $PBS_NODESFILE. > > > Interesting - because that isn't the standard way Torque behaves. It is > supposed to put one line/slot in the nodefile, each line containing the > name of the node. Clearly, SGI has reconfigured Torque to do something > different. > > > The system runs as 2 partitions of 2k-cores each. > So, the contents odf$PBS_NODESFILE has exactly 2 possible values, each 1 > line. > > The values of PBS_PPN and PBS_NCPUS both reflect the size of the > allocation. > > At a minimum, shouldn't Open MPI be multiplying the lines in > $PBS_NODESFILE by the value of $PBS_PPN? > > > No, as above, that isn't the way Torque generally behaves. It would appear > that we need a "switch" here to handle SGI's modifications. Should be > doable - just haven't had anyone using an SGI machine before :-) > > > Additionally, when I try "mpirun -npernode 16 ./ring_c" I am still told > there are not enough slots. > Shouldn't that be working with 1 line is $PBS_NODESFILE? > > -Paul > > > > > On Fri, Jan 31, 2014 at 2:47 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> We read the nodes from the PBS_NODEFILE, Paul - can you pass that along? >> >> On Jan 31, 2014, at 2:33 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> >> I am trying to test the trunk on an SGI UV (to validate Nathan's port of >> btl:vader to SGI's variant of xpmem). >> >> At configure time, PBS's TM support was correctly located. >> >> My PBS batch script includes >> #PBS -l ncpus=16 >> because that is what this installation requires (not nodes, mppnodes, or >> anything like that). >> One is allocating cpus on a large shared-memory machine, not a set of >> nodes in a cluster. >> >> However, this appears to be causing mpirun to think I have just 1 slot: >> >> + mpirun -np 2 ./ring_c >> -------------------------------------------------------------------------- >> There are not enough slots available in the system to satisfy the 2 slots >> that were requested by the application: >> ./ring_c >> >> Either request fewer slots for your application, or make more slots >> available >> for use. >> -------------------------------------------------------------------------- >> >> In case they contain useful info, here are the PBS env vars in the job: >> >> PBS_HT_NCPUS=32 >> PBS_VERSION=TORQUE-2.3.13 >> PBS_JOBNAME=qs >> PBS_ENVIRONMENT=PBS_BATCH >> PBS_HOME=/var/spool/torque >> >> PBS_O_WORKDIR=/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-trunk-linux-x86_64-uv-trunk/BLD/examples >> PBS_PPN=16 >> PBS_TASKNUM=1 >> PBS_O_HOME=/usr/users/6/hargrove >> PBS_MOMPORT=15003 >> PBS_O_QUEUE=debug >> PBS_O_LOGNAME=hargrove >> PBS_O_LANG=en_US.UTF-8 >> PBS_JOBCOOKIE=9EEF5DF75FA705A241FEF66EDFE01C5B >> PBS_NODENUM=0 >> PBS_O_SHELL=/usr/psc/shells/bash >> PBS_SERVER=tg-login1.blacklight.psc.teragrid.org >> PBS_JOBID=314827.tg-login1.blacklight.psc.teragrid.org >> PBS_NCPUS=16 >> PBS_O_HOST=tg-login1.blacklight.psc.teragrid.org >> PBS_VNODENUM=0 >> PBS_QUEUE=debug_r1 >> PBS_O_MAIL=/var/mail/hargrove >> PBS_NODEFILE=/var/spool/torque/aux// >> 314827.tg-login1.blacklight.psc.teragrid.org >> PBS_O_PATH=[...removed...] >> >> If any additional info is needed to help make mpirun "just work", please >> let me know. >> >> However, at this point I am mostly interested in any work-arounds that >> will let me run something other than a singleton on this system. >> >> -Paul >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> Computer and Data Sciences Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900