It's missing the hostname from the other process - should have been included in 
the data passed into each proc at startup, which is why it's so puzzling.

On Jan 9, 2014, at 8:56 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> Ralph,
> 
> The problem has occurred with two builds (both PGI-based) on head nodes of 
> two clusters managed by TORQUE, not by SLURM.  Somehow configure on the first 
> picked up SLURM headers and libs, but not TM.  While the second picked up the 
> TM headers and libs.
> 
> I'll try a gcc-based build on one of the systems ASAP.
> Is there no way (w/o source mods) to know what datum is missing?
> 
> -Paul
> 
> 
> 
> On Thu, Jan 9, 2014 at 8:35 PM, Ralph Castain <r...@open-mpi.org> wrote:
> From your ompi_info output, it looks like this is a slurm system - yes? 
> Wouldn't really matter anyway as we run fine on a head node without an 
> allocation, but worth clarifying.
> 
> What the message is indicating is a failure of the modex - we are missing an 
> expected piece of data. I don't see anything obvious as the source of the 
> problem - works fine for me on all my machines, including on front end of a 
> slurm cluster.
> 
> Only possibly relevant thing I see is that this was built with PGI - any 
> chance you could try a gcc based build? All my tests are done with gcc, so 
> I'm wondering if PGI is the source of the trouble here.
> 
> 
> On Jan 9, 2014, at 6:17 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> 
>> I've now seen this same failure mode on another Linux system.
>> I forgot to mention before that the job is hung after issuing the error 
>> message.
>> Singleton runs fail in the same manner.
>> 
>> Both are front-end machines and perhaps that is related to this failure; for 
>> instance expecting an allocation because of the batch system detected at 
>> configure time.  However, I would have expected a more informative error 
>> message for that case.
>> 
>> -Paul
>> 
>> 
>> On Thu, Jan 9, 2014 at 5:03 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>> Trying to run on the front-end of one of our production Linux systems I see 
>> the following:
>> 
>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>> [cvrsvc01:17692] [[42051,1],0] ORTE_ERROR_LOG: Data for specified key not 
>> found in file 
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c
>>  at line 505
>> [cvrsvc01:17693] [[42051,1],1] ORTE_ERROR_LOG: Data for specified key not 
>> found in file 
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c
>>  at line 505
>> 
>> The "ompi_info --all" output is attached.
>> 
>> Please let me know what MCA param(s) to set to collect any additional info 
>> needed to track down the problem.
>> 
>> -Paul
>> 
>> 
>> -- 
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to