On Aug 26, 2014, at 6:09 PM, Andrej Prsa <aprs...@gmail.com> wrote: > Hi Ralph, > >> I don't know what version of OMPI you're working with, so I can't >> precisely pinpoint the line in question. However, it looks likely to >> be an error caused by not finding the PBS nodefile. > > This is openmpi 1.6.5. > >> We look in the environment for PBS_NODEFILE to find the directory >> where the file should be found, and then look for a file named with >> our Torque-assigned jobid in that place. The open failure indicates >> that it isn't there or isn't readable by us. > > Does that mean that I misunderstand the --with-libpci switch for hwloc > and --enable-cpuset for torque? I had thought that this eliminates the > need for $PBS_NODEFILE.
I'm afraid not - it has nothing to do with it. We need the nodefile to tell us what nodes were allocated for the job. The other switches can tell us which cores are available for our use on each of those nodes. > >> If you are on a network file system, then it's possible that Torque >> is creating the file on your server, but the compute node just isn't >> seeing it fast enough. You might look at potential NFS setup switches >> to speed-up the sync. > > Indeed the compute nodes are NFS-mounted. I'll take a look at sync > parameters. Thanks for the pointer. I suspect this is the problem. > > Cheers, > Andrej > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15728.php