Ralph, Since this turned out to be a matter of an unsupported system configuration, it is my opinion that this doesn't need to be addressed for 1.7.4 if it would cause any further delay.
Also, I noticed this system has lo and lo:0. I know the TCP BTL doesn't support virtual interfaces (trac ticket 3339). So, I mention it here in case oob:tcp has similar issues. -Paul On Fri, Jan 10, 2014 at 1:02 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Jan 10, 2014, at 12:59 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Ralph, > > This is the front end of a production cluster at NERSC. > So, I would not be surprised if there is a fairly restrictive firewall > configuration in place. > However, I could't find a way to query the configuration. > > > Aha - indeed, that is the problem. > > > The verbose output with (only) "-mca oob_base_verbose 10" is attached. > > On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS! > Is there some reason why the loopback interface is not being used > automatically for the single-host case? > That would seem to be a straight-forward solution to this issue. > > > Yeah, we should do a better job of that - I'll take a look and see what > can be done in the near term. > > Thanks! > Ralph > > > -Paul > > > On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Bingo - the proc can't send a message to the daemon to tell it "i'm alive >> and need my nidmap data". I suspect we'll find that your headnode isn't >> allowing us to open a socket for communication between two processes on it, >> and we don't have (yet) a pipe-like mechanism to replace it. >> >> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line - >> should see the oob indicate that it fails to make the connection back to >> the daemon >> >> >> On Jan 10, 2014, at 12:33 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> >> Ralph, >> >> Configuring using a proper --with-tm=... I find that I *can* run a >> singleton in an allocation ("qsub -I -l nodes=1 ...."). >> The case of a singleton on the front end is still failing. >> >> The verbose output using "-mca state_base_verbose 5 -mca >> plm_base_verbose 5 -mca odls_base_verbose 5" is attached. >> >> -Paul >> >> >> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> >>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: >>> >>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove <phhargr...@lbl.gov>wrote: >>> >>>> >>>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org>wrote: >>>> >>>>> ??? that was it? Was this built with --enable-debug? >>>> >>>> >>>> Nope, I missed --enable-debug. Will try again. >>>> >>>> >>> OK, Take-2 below. >>> There is an obvious "recipient list is empty!" in the output. >>> >>> >>> That one is correct and expected - all it means is that you are running >>> on only one node, so mpirun doesn't need to relay messages to another daemon >>> >>> >>> -Paul >>> >>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca >>> orte_nidmap_verbose 10 examples/ring_c' >>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad] >>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set >>> priority to 10 >>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad] >>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm >>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap >>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] >>> tag 1 >>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes >>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list >>> is empty! >>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap >>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 >>> bytes >>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0 >>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] >>> CONTRIBUTE 2 >>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1 >>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] >>> CONTRIBUTE 2 >>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2 >>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] >>> CONTRIBUTE 2 >>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad] >>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set >>> priority to 10 >>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad] >>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm >>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key >>> not found in file >>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c >>> at line 503 >>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad] >>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set >>> priority to 10 >>> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad] >>> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm >>> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key >>> not found in file >>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c >>> at line 503 >>> >>> >>> >>> This is very weird - it appears that your procs are looking for hostname >>> data prior to receiving the necessary data. Let's try jacking up the debug, >>> I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca >>> odls_base_verbose 5" >>> >>> Sorry that will be rather wordy, but I don't understand the ordering you >>> show above. It's like your procs are skipping a bunch of steps in the >>> startup procedure. >>> >>> Out of curiosity, if you do have an allocation on run on it, does it >>> work? >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Future Technologies Group >>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> Computer and Data Sciences Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> <log-fe.bz2>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <log-fe-2.bz2>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900