+1 if i remember correctly, all the interfaces are scanned, so there should be some room to display a user-friendly message (on Linux and impacted architectures) such as "there is no loopback interface, you will likely run into some trouble"
Gilles On 2014/12/03 13:50, Paul Hargrove wrote: > IMHO the lack of a loopback interface should be a very uncommon occurrence. > So, I believe that improving the error message to mention that possibility > would help a great deal. > > -Paul > > > On Tue, Dec 2, 2014 at 8:28 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> We talked about this on the weekly conference call, and adding the usock >> component to 1.8 is just not within our procedures. It would involve >> bringing over much more of the OOB revisions (we'd have to handle the >> transfer of messages between components, if nothing else), and that >> involves a lot of change. >> >> I'll instead try to provide a faster error response so it is clearer what >> is happening, hopefully letting the user fix the problem by turning on the >> loopback interface. >> >> >> On Nov 25, 2014, at 7:05 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> >> On Nov 25, 2014, at 6:15 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >> Ralph and Paul, >> >> On 2014/11/26 10:37, Ralph Castain wrote: >> >> So it looks like the issue isn't so much with our code as it is with the OS >> stack, yes? We aren't requiring that the loopback be "up", but the stack is >> in order to establish the connection, even when we are trying a non-lo >> interface. >> >> this is correct (imho) >> >> I can look into generating a faster timeout on the socket creation. In the >> trunk, we now use unix domain sockets instead of TCP to avoid such issues, >> but that won't help with the 1.8 series. >> >> i was about to suggest this situation could have been avoided in the >> first place by using unix domain sockets instead of TCP sockets :-) >> >> >> There were some historical reasons for not doing so - mostly because it >> generally isn't necessary on a cluster. >> >> >> is a backport (since this is already available in the trunk/master) simply >> out of the question ? >> >> >> It would be against our normal procedures, but I can raise it at next >> week's meeting. >> >> >> Cheers, >> >> Gilles >> >> On Nov 25, 2014, at 4:50 PM, Paul Hargrove <phhargr...@lbl.gov> >> <phhargr...@lbl.gov> wrote: >> >> Ralph, >> >> I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello" >> I find that there is an attempt (by a secondary thread) to establish a TCP >> socket from the rank process to the eth0 address of localhost (I am guessing >> to reach the orted/mpirun). >> However, when the "lo" interface is down, the Linux kernel apparently cannot >> establish that socket. >> >> In fact, if I am sufficiently patient, it turns out the "hang" is bounded, >> and eventually one sees: >> >> phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out >> ------------------------------------------------------------ >> A process or daemon was unable to complete a TCP connection >> to another process: >> Local host: blcr-armv7 >> Remote host: 10.0.2.15 >> This is usually caused by a firewall on the remote host. Please >> check that any firewall (e.g., iptables) has been disabled and >> try again. >> ------------------------------------------------------------ >> >> real 2m8.151s >> user 0m5.360s >> sys 0m57.430s >> >> >> Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host. >> >> There is no firewall, but in case you doubt me on that, here is a >> demonstration using ping to show that 10.0.2.15 is only reachable when the >> loopback interface is enabled: >> >> phargrov@blcr-armv7:~$ sudo ifconfig lo up >> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15 >> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data. >> >> --- 10.0.2.15 ping statistics --- >> 2 packets transmitted, 2 received, 0% packet loss, time 1002ms >> rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms >> >> >> phargrov@blcr-armv7:~$ sudo ifconfig lo down >> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15 >> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data. >> >> --- 10.0.2.15 ping statistics --- >> 2 packets transmitted, 0 received, 100% packet loss, time 1006ms >> >> >> So, there is no "hang" -- just a 2 minute pause before the error message is >> generated. >> However, it may still be possible to present a better/earlier error message >> when there is no loopback interface (and at least one rank process is to be >> launched locally). >> >> >> -Paul >> >> On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org> <r...@open-mpi.org>> wrote: >> I'll have to look - there isn't supposed to be such a requirement, and I >> certainly haven't seen it before. >> >> >> >> On Nov 25, 2014, at 3:26 PM, Paul Hargrove <phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>> wrote: >> >> Allan, >> >> I am glad things are working for you now. >> I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu >> 14.04) that disabling the "lo" interface reproduces the problem. >> I imagine this is true on other architectures, though I did not attempt to >> verify. >> >> Ralph, >> >> If oob:tcp really does need the loopback interface, shouldn't its lack be >> something that could/should be detected and reported instead of hanging as >> Allan saw? >> >> FWIW, neither of the following resolved the problem: >> -mca oob_tcp_if_exclude lo >> -mca oob_tcp_if_include eth0 >> >> >> -Paul >> >> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu <al...@cs.ucla.edu >> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote: >> I think I have found the problem. After inspecting the output with "-mca >> state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" >> on both the old system and the new system, I noticed there is one line that >> is different: on the old system where it works correctly, there is a line >> that says: "oob:tcp:init rejecting loopback interface lo", while on the new >> system there is no such line. Both system proceed to open interface eth0 >> afterwards. Then I checked the new system, and found out that somehow the >> loopback interface is not up by default. After I opened the lo interface, >> the mpirun executes normally. >> >> Does it means that OpenMPI will use lo for some initial setup? Since the >> actual socket was created on eth0 I did not think of checking the lo >> interface. Anyway, thanks everyone for all of your kind help. Let me know if >> you want me to provide any more information for future references. >> >> Regards, >> Allan >> >> -- >> Di Wu (Allan) >> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/> >> <http://vast.cs.ucla.edu/>, >> Department of Computer Science, UC Los Angeles >> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu> >> >> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu >> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote: >> Thanks Ralph! >> >> I did not compile my openmpi with --enable-debug, and I am compiling it now. >> But your suggested command already provided some output, which I attached >> with this email. >> >> It seems the process was stuck on the line: >> "[fpga2:00962] [[44848,1],0] waiting for connect completion to [[44848,0],0] >> - activating send event" >> >> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said >> something about 'orte_tcp_peer_try_connect: attempting to connect to proc >> [[44848,0],0] via interface eth0'. >> >> Regards, >> Di >> >> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org> <r...@open-mpi.org>> wrote: >> This is all running on a single node, correct? If so, did you configure OMPI >> with â EURO "enable-debug? >> If you can do that, or already have, then letâ EURO (tm)s add the following >> to the mpirun cmd line: >> >> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 10 >> >> Youâ EURO (tm)ll get a bunch of output, but hopefully it will tell us where >> mpirun is encountering a problem. >> Ralph >> >> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>> wrote: >> Allan, >> >> If you send me the .config from your build of the kernel I can compare it >> against, for instance, my .config for a Raspberry Pi. >> There will certainly be many differences, but I am hoping my own experience >> configuring linux kernels will help me filter the "noise" from any >> differences that might be significant. >> >> -Paul >> >> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu >> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote: >> Thanks Paul! Unfortunately '/boot' is not available in my embedded linux, >> and I do not have the configuration file for the old kernel since it is >> provided as is. However, I have the new kernel configuration since I >> compiled it myself. Would it be helpful if I provide you the .config file >> when I compile the kernel? It maybe quite painful to look through that file >> though. Is there any other way that I can obtain the configuration? >> >> I checked my config for the new kernel, and UNIX-domain sockets and Sys V >> IPC are both enabled in the build. Are there any other possibilities I can >> check? >> >> Thanks, >> Di >> >> -- >> Di Wu (Allan) >> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/> >> <http://vast.cs.ucla.edu/>, >> Department of Computer Science, UC Los Angeles >> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu> >> >> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>> wrote: >> Allan, >> >> A likely possibility is that some important kernel feature (that Open MPI >> assumes is present) is missing. >> That includes not only "kernel modules" as you mention, but also features >> configure in (or out) of the base kernel. >> For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC >> support. >> >> If you can send me (preferably off-list) the kernel config files for the old >> an new kernels I may be able to spot something. >> If present, you are looking for /boot/config-[VERSION] >> >> -Paul >> >> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu >> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote: >> I'm sorry I forgot to change the subject when I reply to the digest issue. >> Please find my original email below. >> >> Regards, >> Di >> >> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu >> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote: >> Thanks Ralph for the reply. Sorry about the log file, I think I forgot to >> put an extension to the file. Please find a new one attached with this email. >> >> I'm sorry for not enough debugging information, but 'omp_info' and >> '--debug-devel' are the only ways I know for collecting information, are >> there any other things I can try to provide more info? >> >> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output is >> the logging information in my last email. It got stuck at "[fpga1:00718] >> tmp: /tmp", and nothing from my helloworld program is printed out to the >> screen. So I think it is mpirun failing to start my executable, not failing >> to terminate. >> >> I was wondering if this has anything to do with my newer kernel version, >> since it works well in the old case. >> >> Thanks, >> -- >> Di Wu (Allan) >> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/> >> <http://vast.cs.ucla.edu/>, >> Department of Computer Science, UC Los Angeles >> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu> >> >> >> Date: Tue, 25 Nov 2014 07:29:51 -0800 >> From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org> >> <r...@open-mpi.org>> >> To: Open MPI Developers <de...@open-mpi.org <mailto:de...@open-mpi.org> >> <de...@open-mpi.org>> >> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at >> execution on an embedded ARM Linux kernel version 3.15.0 >> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org >> <mailto:898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org> >> <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>> >> Content-Type: text/plain; charset="utf-8" >> >> I don?t know what you put in that log file, but it was an executable and I?m >> not feeling that trusting :-) >> >> I?m afraid there isn?t enough debug output there to really tell anything. >> From what little I can see, I?m guessing that the application ran fine and >> you got the usual ?hello? output and the helloworld process exited safely - >> is that correct? And so it is solely mpirun that is failing to cleanly >> terminate? >> >> >> >> On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu >> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote: >> >> Hello everyone, >> >> I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works >> fine for my system based on Linux 3.8.0. I have previously submitted a post >> related to my compilation, which can be found here: >> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php >> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php> >> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php> >> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php >> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php> >> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>>. When I >> recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck >> at even >> the helloworld program. The program consists only simple APIs: MPI_Init, >> MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at >> 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel >> (before it got stuck): >> [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving >> [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0 >> [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0 >> [fpga1:00716] top: openmpi-sessions-root@fpga1_0 >> [fpga1:00716] tmp: /tmp >> [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0 >> [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1 >> [fpga1:00718] top: openmpi-sessions-root@fpga1_0 >> [fpga1:00718] tmp: /tmp >> >> I suspect maybe it is due to incompatible kernel version or some missing >> kernel modules. I tried also with the latest version 1.8.3, and had the same >> problem. Does anyone have any thoughts? I have attached the output of >> 'ompi-info --all' with this email. >> >> Please let me know if I need to provide more information. Thanks in advance! >> >> Regards, >> -- >> Di Wu (Allan) >> PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/ >> <http://vast.cs.ucla.edu/> <http://vast.cs.ucla.edu/>>, >> Department of Computer Science, UC Los Angeles >> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu> >> <mailto:al...@cs.ucla.edu <al...@cs.ucla.edu> <mailto:al...@cs.ucla.edu> >> <al...@cs.ucla.edu>> >> <log.tar.gz>_______________________________________________ >> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> >> <de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16330.php> >> <http://www.open-mpi.org/community/lists/devel/2014/11/16330.php> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> >> <de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16341.php> >> <http://www.open-mpi.org/community/lists/devel/2014/11/16341.php> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> <tel:%2B1-510-495-2352> >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> <tel:%2B1-510-486-6900> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> <tel:%2B1-510-495-2352> >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> <tel:%2B1-510-486-6900> >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> >> <de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16348.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16348.php> >> <http://www.open-mpi.org/community/lists/devel/2014/11/16348.php> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> <tel:%2B1-510-495-2352> >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> <tel:%2B1-510-486-6900>_______________________________________________ >> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> >> <de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16349.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16349.php> >> <http://www.open-mpi.org/community/lists/devel/2014/11/16349.php> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> >> <de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16350.php >> <http://www.open-mpi.org/community/lists/devel/2014/11/16350.php> >> <http://www.open-mpi.org/community/lists/devel/2014/11/16350.php> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov> >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16351.php >> >> >> >> _______________________________________________ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16352.php >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/11/16355.php >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16418.php >> > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16419.php