On Jan 10, 2014, at 12:59 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> Ralph,
> 
> This is the front end of a production cluster at NERSC.
> So, I would not be surprised if there is a fairly restrictive firewall 
> configuration in place.
> However, I could't find a way to query the configuration.
> 

Aha - indeed, that is the problem.

> The verbose output with (only) "-mca oob_base_verbose 10" is attached.
> 
> On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS!
> Is there some reason why the loopback interface is not being used 
> automatically for the single-host case?
> That would seem to be a straight-forward solution to this issue.

Yeah, we should do a better job of that - I'll take a look and see what can be 
done in the near term.

Thanks!
Ralph

> 
> -Paul
> 
> 
> On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Bingo - the proc can't send a message to the daemon to tell it "i'm alive and 
> need my nidmap data". I suspect we'll find that your headnode isn't allowing 
> us to open a socket for communication between two processes on it, and we 
> don't have (yet) a pipe-like mechanism to replace it.
> 
> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line - 
> should see the oob indicate that it fails to make the connection back to the 
> daemon
> 
> 
> On Jan 10, 2014, at 12:33 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> 
>> Ralph,
>> 
>> Configuring using a proper --with-tm=... I find that I *can* run a singleton 
>> in an allocation ("qsub -I -l nodes=1 ....").
>> The case of a singleton on the front end is still failing.
>> 
>> The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose 5 
>> -mca odls_base_verbose 5" is attached.
>> 
>> -Paul
>> 
>> 
>> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>> 
>>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>> 
>>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> ??? that was it? Was this built with --enable-debug?
>>> 
>>> Nope, I missed --enable-debug.  Will try again.
>>> 
>>> 
>>> OK, Take-2 below.
>>> There is an obvious "recipient list is empty!" in the output.
>> 
>> That one is correct and expected - all it means is that you are running on 
>> only one node, so mpirun doesn't need to relay messages to another daemon
>> 
>>> 
>>> -Paul
>>> 
>>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca 
>>> orte_nidmap_verbose 10 examples/ring_c'
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set 
>>> priority to 10
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
>>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is 
>>> empty!
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 
>>> 2
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 
>>> 2
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 
>>> 2
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set 
>>> priority to 10
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
>>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not 
>>> found in file 
>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>>>  at line 503
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set 
>>> priority to 10
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
>>> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not 
>>> found in file 
>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>>>  at line 503
>> 
>> 
>> This is very weird - it appears that your procs are looking for hostname 
>> data prior to receiving the necessary data. Let's try jacking up the debug, 
>> I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca 
>> odls_base_verbose 5"
>> 
>> Sorry that will be rather wordy, but I don't understand the ordering you 
>> show above. It's like your procs are skipping a bunch of steps in the 
>> startup procedure.
>> 
>> Out of curiosity, if you do have an allocation on run on it, does it work?
>> 
>>> 
>>>  
>>> -- 
>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>> Future Technologies Group
>>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>> <log-fe.bz2>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> <log-fe-2.bz2>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to