+1

if i remember correctly, all the interfaces are scanned, so there should
be some room to display
a user-friendly message (on Linux and impacted architectures) such as
"there is no loopback interface, you will likely run into some trouble"

Gilles

On 2014/12/03 13:50, Paul Hargrove wrote:
> IMHO the lack of a loopback interface should be a very uncommon occurrence.
> So, I believe that improving the error message to mention that possibility
> would help a great deal.
>
> -Paul
>
>
> On Tue, Dec 2, 2014 at 8:28 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> We talked about this on the weekly conference call, and adding the usock
>> component to 1.8 is just not within our procedures. It would involve
>> bringing over much more of the OOB revisions (we'd have to handle the
>> transfer of messages between components, if nothing else), and that
>> involves a lot of change.
>>
>> I'll instead try to provide a faster error response so it is clearer what
>> is happening, hopefully letting the user fix the problem by turning on the
>> loopback interface.
>>
>>
>> On Nov 25, 2014, at 7:05 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>
>> On Nov 25, 2014, at 6:15 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>>  Ralph and Paul,
>>
>> On 2014/11/26 10:37, Ralph Castain wrote:
>>
>> So it looks like the issue isn't so much with our code as it is with the OS 
>> stack, yes? We aren't requiring that the loopback be "up", but the stack is 
>> in order to establish the connection, even when we are trying a non-lo 
>> interface.
>>
>>  this is correct (imho)
>>
>> I can look into generating a faster timeout on the socket creation. In the 
>> trunk, we now use unix domain sockets instead of TCP to avoid such issues, 
>> but that won't help with the 1.8 series.
>>
>>  i was about to suggest this situation could have been avoided in the
>> first place by using unix domain sockets instead of TCP sockets :-)
>>
>>
>> There were some historical reasons for not doing so - mostly because it
>> generally isn't necessary on a cluster.
>>
>>
>> is a backport (since this is already available in the trunk/master) simply
>> out of the question ?
>>
>>
>> It would be against our normal procedures, but I can raise it at next
>> week's meeting.
>>
>>
>> Cheers,
>>
>> Gilles
>>
>>   On Nov 25, 2014, at 4:50 PM, Paul Hargrove <phhargr...@lbl.gov> 
>> <phhargr...@lbl.gov> wrote:
>>
>> Ralph,
>>
>> I had a look at the problem via "mpirun -np 1 strace -o trace -ff ./hello"
>> I find that there is an attempt (by a secondary thread) to establish a TCP 
>> socket from the rank process to the eth0 address of localhost (I am guessing 
>> to reach the orted/mpirun).
>> However, when the "lo" interface is down, the Linux kernel apparently cannot 
>> establish that socket.
>>
>> In fact, if I am sufficiently patient, it turns out the "hang" is bounded, 
>> and eventually one sees:
>>
>> phargrov@blcr-armv7:~$ time mpirun -np 1 ./a.out
>> ------------------------------------------------------------
>> A process or daemon was unable to complete a TCP connection
>> to another process:
>>   Local host:    blcr-armv7
>>   Remote host:   10.0.2.15
>> This is usually caused by a firewall on the remote host. Please
>> check that any firewall (e.g., iptables) has been disabled and
>> try again.
>> ------------------------------------------------------------
>>
>> real    2m8.151s
>> user    0m5.360s
>> sys     0m57.430s
>>
>>
>> Where blcr-armv7 and 10.0.2.15 are *both* the local (only) host.
>>
>> There is no firewall, but in case you doubt me on that, here is a 
>> demonstration using ping to show that 10.0.2.15 is only reachable when the 
>> loopback interface is enabled:
>>
>> phargrov@blcr-armv7:~$ sudo ifconfig lo up
>> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
>> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
>>
>> --- 10.0.2.15 ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
>> rtt min/avg/max/mdev = 0.527/0.534/0.542/0.024 ms
>>
>>
>> phargrov@blcr-armv7:~$ sudo ifconfig lo down
>> phargrov@blcr-armv7:~$ ping -q -c2 10.0.2.15
>> PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data.
>>
>> --- 10.0.2.15 ping statistics ---
>> 2 packets transmitted, 0 received, 100% packet loss, time 1006ms
>>
>>
>> So, there is no "hang" -- just a 2 minute pause before the error message is 
>> generated.
>> However, it may still be possible to present a better/earlier error message 
>> when there is no loopback interface (and at least one rank process is to be 
>> launched locally).
>>
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 4:19 PM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org>> wrote:
>> I'll have to look - there isn't supposed to be such a requirement, and I 
>> certainly haven't seen it before.
>>
>>
>>
>>  On Nov 25, 2014, at 3:26 PM, Paul Hargrove <phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>> wrote:
>>
>> Allan,
>>
>> I am glad things are working for you now.
>> I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu 
>> 14.04) that disabling the "lo" interface reproduces the problem.
>> I imagine this is true on other architectures, though I did not attempt to 
>> verify.
>>
>> Ralph,
>>
>> If oob:tcp really does need the loopback interface, shouldn't its lack be 
>> something that could/should be detected and reported instead of hanging as 
>> Allan saw?
>>
>> FWIW, neither of the following resolved the problem:
>>     -mca oob_tcp_if_exclude lo
>>     -mca oob_tcp_if_include eth0
>>
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu <al...@cs.ucla.edu 
>> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote:
>> I think I have found the problem. After inspecting the output with "-mca 
>> state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 100" 
>> on both the old system and the new system, I noticed there is one line that 
>> is different: on the old system where it works correctly, there is a line 
>> that says: "oob:tcp:init rejecting loopback interface lo", while on the new 
>> system there is no such line. Both system proceed to open interface eth0 
>> afterwards. Then I checked the new system, and found out that somehow the 
>> loopback interface is not up by default. After I opened the lo interface, 
>> the mpirun executes normally.
>>
>> Does it means that OpenMPI will use lo for some initial setup? Since the 
>> actual socket was created on eth0 I did not think of checking the lo 
>> interface. Anyway, thanks everyone for all of your kind help. Let me know if 
>> you want me to provide any more information for future references.
>>
>> Regards,
>> Allan
>>
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/> 
>> <http://vast.cs.ucla.edu/>,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>
>>
>> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu 
>> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote:
>> Thanks Ralph!
>>
>> I did not compile my openmpi with --enable-debug, and I am compiling it now. 
>> But your suggested command already provided some output, which I attached 
>> with this email.
>>
>> It seems the process was stuck on the line:
>> "[fpga2:00962] [[44848,1],0] waiting for connect completion to [[44848,0],0] 
>> - activating send event"
>>
>> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said 
>> something about 'orte_tcp_peer_try_connect: attempting to connect to proc 
>> [[44848,0],0] via interface eth0'.
>>
>> Regards,
>> Di
>>
>> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org>> wrote:
>> This is all running on a single node, correct? If so, did you configure OMPI 
>> with â EURO "enable-debug?
>> If you can do that, or already have, then letâ EURO (tm)s add the following 
>> to the mpirun cmd line:
>>
>> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 10
>>
>> Youâ EURO (tm)ll get a bunch of output, but hopefully it will tell us where 
>> mpirun is encountering a problem.
>> Ralph
>>
>> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>> wrote:
>> Allan,
>>
>> If you send me the .config from your build of the kernel I can compare it 
>> against, for instance, my .config for a Raspberry Pi.
>> There will certainly be many differences, but I am hoping my own experience 
>> configuring linux kernels will help me filter the "noise" from any 
>> differences that might be significant.
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu 
>> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote:
>> Thanks Paul! Unfortunately '/boot' is not available in my embedded linux, 
>> and I do not have the configuration file for the old kernel since it is 
>> provided as is. However, I have the new kernel configuration since I 
>> compiled it myself. Would it be helpful if I provide you the .config file 
>> when I compile the kernel? It maybe quite painful to look through that file 
>> though. Is there any other way that I can obtain the configuration?
>>
>> I checked my config for the new kernel, and UNIX-domain sockets and Sys V 
>> IPC are both enabled in the build. Are there any other possibilities I can 
>> check?
>>
>> Thanks,
>> Di
>>
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/> 
>> <http://vast.cs.ucla.edu/>,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>
>>
>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>> wrote:
>> Allan,
>>
>> A likely possibility is that some important kernel feature (that Open MPI 
>> assumes is present) is missing.
>> That includes not only "kernel modules" as you mention, but also features 
>> configure in (or out) of the base kernel.
>> For instance, some embedded kernels omit UNIX-domain sockets and SysV IPC 
>> support.
>>
>> If you can send me (preferably off-list) the kernel config files for the old 
>> an new kernels I may be able to spot something.
>> If present, you are looking for /boot/config-[VERSION]
>>
>> -Paul
>>
>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu 
>> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote:
>> I'm sorry I forgot to change the subject when I reply to the digest issue. 
>> Please find my original email below.
>>
>> Regards,
>> Di
>>
>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu 
>> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote:
>> Thanks Ralph for the reply. Sorry about the log file, I think I forgot to 
>> put an extension to the file. Please find a new one attached with this email.
>>
>> I'm sorry for not enough debugging information, but 'omp_info' and 
>> '--debug-devel' are the only ways I know for collecting information, are 
>> there any other things I can try to provide more info?
>>
>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the output is 
>> the logging information in my last email. It got stuck at  "[fpga1:00718] 
>> tmp: /tmp", and nothing from my helloworld program is printed out to the 
>> screen. So I think it is mpirun failing to start my executable, not failing 
>> to terminate.
>>
>> I was wondering if this has anything to do with my newer kernel version, 
>> since it works well in the old case.
>>
>> Thanks,
>> --
>> Di Wu (Allan)
>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/> 
>> <http://vast.cs.ucla.edu/>,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>
>>
>>
>> Date: Tue, 25 Nov 2014 07:29:51 -0800
>> From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> <r...@open-mpi.org>>
>> To: Open MPI Developers <de...@open-mpi.org <mailto:de...@open-mpi.org> 
>> <de...@open-mpi.org>>
>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
>>         execution       on an embedded ARM Linux kernel version 3.15.0
>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org 
>> <mailto:898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org> 
>> <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>>
>> Content-Type: text/plain; charset="utf-8"
>>
>> I don?t know what you put in that log file, but it was an executable and I?m 
>> not feeling that trusting :-)
>>
>> I?m afraid there isn?t enough debug output there to really tell anything. 
>> From what little I can see, I?m guessing that the application ran fine and 
>> you got the usual ?hello? output and the helloworld process exited safely - 
>> is that correct? And so it is solely mpirun that is failing to cleanly 
>> terminate?
>>
>>
>>
>>  On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu 
>> <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu>> wrote:
>>
>> Hello everyone,
>>
>> I have cross-compiled OpenMPI for an embedded ARM Linux. Everything works 
>> fine for my system based on Linux 3.8.0. I have previously submitted a post 
>> related to my compilation, which can be found here: 
>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>>. When I 
>> recently upgraded my Linux kernel to 3.15.0, mpirun begins to stuck
>>  at even
>>  the helloworld program. The program consists only simple APIs: MPI_Init, 
>> MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs even at 
>> 'mpirun -np 1 ./helloworld', and below are the output with --debug-devel 
>> (before it got stuck):
>> [fpga1:00716] sess_dir_finalize: job session dir not empty - leaving
>> [fpga1:00716] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0/0
>> [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
>> [fpga1:00716] top: openmpi-sessions-root@fpga1_0
>> [fpga1:00716] tmp: /tmp
>> [fpga1:00718] procdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1/0
>> [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
>> [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>> [fpga1:00718] tmp: /tmp
>>
>> I suspect maybe it is due to incompatible kernel version or some missing 
>> kernel modules. I tried also with the latest version 1.8.3, and had the same 
>> problem. Does anyone have any thoughts? I have attached the output of 
>> 'ompi-info --all' with this email.
>>
>> Please let me know if I need to provide more information. Thanks in advance!
>>
>> Regards,
>> --
>> Di Wu (Allan)
>> PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/ 
>> <http://vast.cs.ucla.edu/> <http://vast.cs.ucla.edu/>>,
>> Department of Computer Science, UC Los Angeles
>> Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> <al...@cs.ucla.edu> 
>> <mailto:al...@cs.ucla.edu <al...@cs.ucla.edu> <mailto:al...@cs.ucla.edu> 
>> <al...@cs.ucla.edu>>
>> <log.tar.gz>_______________________________________________
>> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
>> <de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16330.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16330.php>
>>
>>  _______________________________________________
>> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
>> <de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16341.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16341.php>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352 
>> <tel:%2B1-510-495-2352>
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900 
>> <tel:%2B1-510-486-6900>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352 
>> <tel:%2B1-510-495-2352>
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900 
>> <tel:%2B1-510-486-6900>
>>
>>
>>
>> _______________________________________________
>> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
>> <de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16348.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16348.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16348.php>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352 
>> <tel:%2B1-510-495-2352>
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900 
>> <tel:%2B1-510-486-6900>_______________________________________________
>> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
>> <de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16349.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16349.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16349.php>
>>
>>  _______________________________________________
>> devel mailing listde...@open-mpi.org <mailto:de...@open-mpi.org> 
>> <de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16350.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16350.php> 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16350.php>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov> <phhargr...@lbl.gov>
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16351.php
>>
>>
>>
>> _______________________________________________
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16352.php
>>
>>
>>  _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16355.php
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16418.php
>>
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16419.php

Reply via email to