It's missing the hostname from the other process - should have been included in the data passed into each proc at startup, which is why it's so puzzling.
On Jan 9, 2014, at 8:56 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > Ralph, > > The problem has occurred with two builds (both PGI-based) on head nodes of > two clusters managed by TORQUE, not by SLURM. Somehow configure on the first > picked up SLURM headers and libs, but not TM. While the second picked up the > TM headers and libs. > > I'll try a gcc-based build on one of the systems ASAP. > Is there no way (w/o source mods) to know what datum is missing? > > -Paul > > > > On Thu, Jan 9, 2014 at 8:35 PM, Ralph Castain <r...@open-mpi.org> wrote: > From your ompi_info output, it looks like this is a slurm system - yes? > Wouldn't really matter anyway as we run fine on a head node without an > allocation, but worth clarifying. > > What the message is indicating is a failure of the modex - we are missing an > expected piece of data. I don't see anything obvious as the source of the > problem - works fine for me on all my machines, including on front end of a > slurm cluster. > > Only possibly relevant thing I see is that this was built with PGI - any > chance you could try a gcc based build? All my tests are done with gcc, so > I'm wondering if PGI is the source of the trouble here. > > > On Jan 9, 2014, at 6:17 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > >> I've now seen this same failure mode on another Linux system. >> I forgot to mention before that the job is hung after issuing the error >> message. >> Singleton runs fail in the same manner. >> >> Both are front-end machines and perhaps that is related to this failure; for >> instance expecting an allocation because of the batch system detected at >> configure time. However, I would have expected a more informative error >> message for that case. >> >> -Paul >> >> >> On Thu, Jan 9, 2014 at 5:03 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> Trying to run on the front-end of one of our production Linux systems I see >> the following: >> >> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >> [cvrsvc01:17692] [[42051,1],0] ORTE_ERROR_LOG: Data for specified key not >> found in file >> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c >> at line 505 >> [cvrsvc01:17693] [[42051,1],1] ORTE_ERROR_LOG: Data for specified key not >> found in file >> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c >> at line 505 >> >> The "ompi_info --all" output is attached. >> >> Please let me know what MCA param(s) to set to collect any additional info >> needed to track down the problem. >> >> -Paul >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> Computer and Data Sciences Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> Computer and Data Sciences Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel