Allan,

I am glad things are working for you now.
I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu
14.04) that disabling the "lo" interface reproduces the problem.
I imagine this is true on other architectures, though I did not attempt to
verify.

Ralph,

If oob:tcp really does need the loopback interface, shouldn't its lack be
something that could/should be detected and reported instead of hanging as
Allan saw?

FWIW, neither of the following resolved the problem:
    -mca oob_tcp_if_exclude lo
    -mca oob_tcp_if_include eth0


-Paul

On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu <al...@cs.ucla.edu> wrote:

> I think I have found the problem. After inspecting the output with
> 
> "-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
> oob_base_verbose 10
> 0
> "
> 
> on both the old system and the new system, I noticed there is one line
> 
> that is
> 
> different
> :
>
> o
> n the old system where it works correctly, there is a line that says:
> "oob:tcp:init rejecting loopback interface lo"
> ,
> while
>  
> on the new system there is no such line. Both system proceed to open
> interface eth0 afterwards. Then I checked the new system, and found out
> that somehow the loopback interface is not up by default. After I opened
> the lo interface, the mpirun executes normally.
>
> Does it means that OpenMPI will use lo for some initial setup? Since the
> actual socket was created on eth0 I did not think of checking the lo
> interface. Anyway, thanks everyone for all of your kind help. Let me know
> if you want me to provide any more information for future references.
>
> Regards,
> Allan
>
> --
> Di Wu (Allan)
> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
> Department of Computer Science, UC Los Angeles
> Email: al...@cs.ucla.edu
>
> On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>
>> Thanks Ralph!
>>
>> I did not compile my openmpi with --enable-debug, and I am compiling it
>> now. But your suggested command already provide
>> d
>> some output, which I attached with this email.
>>
>> It seems the process was stuck on the line:
>> "[fpga2:00962] [[44848,1],0] waiting for connect completion to
>> [[44848,0],0] - activating send event"
>>
>> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said
>> something about 'orte_tcp_peer_try_connect: attempting to connect to proc
>> [[44848,0],0] via interface eth0'
>> .
>>
>>
>> Regards,
>> Di
>>
>> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> 
>>> This is all running on a single node, correct? If so, did you configure
>>> OMPI with â EURO "enable-debug?
>>>
>>> If you can do that, or already have, then letâ EURO (tm)s add the following 
>>> to
>>> the mpirun cmd line:
>>>
>>> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca
>>> oob_base_verbose 10
>>>
>>> Youâ EURO (tm)ll get a bunch of output, but hopefully it will tell us where
>>> mpirun is encountering a problem.
>>> Ralph
>>> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov>
>>> wrote:
>>>
>>>> Allan,
>>>>
>>>> If you send me the .config from your build of the kernel I can compare
>>>> it against, for instance, my .config for a Raspberry Pi.
>>>> There will certainly be many differences, but I am hoping my own
>>>> experience configuring linux kernels will help me filter the "noise" from
>>>> any differences that might be significant.
>>>>
>>>> -Paul
>>>>
>>>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>>>>
>>>>> Thanks Paul! Unfortunately '/boot' is not available in my embedded
>>>>> linux, and I do not have the configuration file for the old kernel since 
>>>>> it
>>>>> is provided as is. However, I have the new kernel configuration since I
>>>>> compiled it myself. Would it be helpful if I provide you the .config file
>>>>> when I compile the kernel? It maybe quite painful to look through that 
>>>>> file
>>>>> though. Is there any other way that I can obtain the configuration?
>>>>>
>>>>> I checked my config for the new kernel, and UNIX-domain sockets and
>>>>> Sys V IPC are both enabled in the build. Are there any other possibilities
>>>>> I can check?
>>>>>
>>>>> Thanks,
>>>>> Di
>>>>>
>>>>> --
>>>>> Di Wu (Allan)
>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
>>>>> Department of Computer Science, UC Los Angeles
>>>>> Email: al...@cs.ucla.edu
>>>>>
>>>>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov>
>>>>> wrote:
>>>>>
>>>>>> Allan,
>>>>>>
>>>>>> A likely possibility is that some important kernel feature (that Open
>>>>>> MPI assumes is present) is missing.
>>>>>> That includes not only "kernel modules" as you mention, but also
>>>>>> features configure in (or out) of the base kernel.
>>>>>> For instance, some embedded kernels omit UNIX-domain sockets and SysV
>>>>>> IPC support.
>>>>>>
>>>>>> If you can send me (preferably off-list) the kernel config files for
>>>>>> the old an new kernels I may be able to spot something.
>>>>>> If present, you are looking for /boot/config-[VERSION]
>>>>>>
>>>>>> -Paul
>>>>>>
>>>>>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu> wrote:
>>>>>>
>>>>>>> I'm sorry I forgot to change the subject when I reply to the digest
>>>>>>> issue. Please find my original email below.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Di
>>>>>>>
>>>>>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Ralph for the reply. Sorry about the log file, I think I
>>>>>>>> forgot to put an extension to the file. Please find a new one attached 
>>>>>>>> with
>>>>>>>> this email.
>>>>>>>>
>>>>>>>> I'm sorry for not enough debugging information, but 'omp_info' and
>>>>>>>> '--debug-devel' are the only ways I know for collecting information, 
>>>>>>>> are
>>>>>>>> there any other things I can try to provide more info?
>>>>>>>>
>>>>>>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the
>>>>>>>> output is the logging information in my last email. It got stuck at
>>>>>>>>  "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program
>>>>>>>> is printed out to the screen. So I think it is mpirun failing to start 
>>>>>>>> my
>>>>>>>> executable, not failing to terminate.
>>>>>>>>
>>>>>>>> I was wondering if this has anything to do with my newer kernel
>>>>>>>> version, since it works well in the old case.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> --
>>>>>>>> Di Wu (Allan)
>>>>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>,
>>>>>>>> Department of Computer Science, UC Los Angeles
>>>>>>>> Email: al...@cs.ucla.edu
>>>>>>>>
>>>>>>>>
>>>>>>>> Date: Tue, 25 Nov 2014 07:29:51 -0800
>>>>>>>> From:
>>>>>>>> Ralph Castain <r...@open-mpi.org>
>>>>>>>> To: Open MPI Developers <de...@open-mpi.org>
>>>>>>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at
>>>>>>>>         execution       on an embedded ARM Linux kernel version
>>>>>>>> 3.15.0
>>>>>>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org>
>>>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>>>>
>>>>>>>> I don?t know what you put in that log file, but it was an
>>>>>>>> executable and I?m not feeling that trusting :-)
>>>>>>>>
>>>>>>>> I?m afraid there isn?t enough debug output there to really tell
>>>>>>>> anything. From what little I can see, I?m guessing that the 
>>>>>>>> application ran
>>>>>>>> fine and you got the usual ?hello? output and the helloworld process 
>>>>>>>> exited
>>>>>>>> safely - is that correct? And so it is solely mpirun that is failing to
>>>>>>>> cleanly terminate?
>>>>>>>>
>>>>>>>>
>>>>>>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu> wrote:
>>>>>>>> >
>>>>>>>> > Hello everyone,
>>>>>>>> >
>>>>>>>> > I have cross-compiled OpenMPI for an embedded ARM Linux.
>>>>>>>> Everything works fine for my system based on Linux 3.8.0. I have 
>>>>>>>> previously
>>>>>>>> submitted a post related to my compilation, which can be found here:
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php <
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>.
>>>>>>>> When I recently upgraded my Linux kernel to 3.15.0, mpirun begins to 
>>>>>>>> stuck
>>>>>>>> at even the helloworld program. The program consists only simple APIs:
>>>>>>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem 
>>>>>>>> occurs
>>>>>>>> even at 'mpirun -np 1 ./helloworld', and below are the output with
>>>>>>>> --debug-devel (before it got stuck):
>>>>>>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty -
>>>>>>>> leaving
>>>>>>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@
>>>>>>>> fpga1_0/63813/0/0
>>>>>>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0
>>>>>>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0
>>>>>>>> > [fpga1:00716] tmp: /tmp
>>>>>>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@
>>>>>>>> fpga1_0/63813/1/0
>>>>>>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1
>>>>>>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0
>>>>>>>> >
>>>>>>>> [fpga1:00718] tmp: /tmp
>>>>>>>> >
>>>>>>>> > I suspect maybe it is due to incompatible kernel version or some
>>>>>>>> missing kernel modules. I tried also with the latest version 1.8.3, 
>>>>>>>> and had
>>>>>>>> the same problem. Does anyone have any thoughts? I have attached the 
>>>>>>>> output
>>>>>>>> of 'ompi-info --all' with this email.
>>>>>>>> >
>>>>>>>> > Please let me know if I need to provide more information. Thanks
>>>>>>>> in advance!
>>>>>>>> >
>>>>>>>> > Regards,
>>>>>>>> > --
>>>>>>>> > Di Wu (Allan)
>>>>>>>> > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/>,
>>>>>>>> > Department of Computer Science, UC Los Angeles
>>>>>>>> > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu>
>>>>>>>> > <log.tar.gz>_______________________________________________
>>>>>>>> > devel mailing list
>>>>>>>> > de...@open-mpi.org
>>>>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> > Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>> Computer Languages & Systems Software (CLaSS) Group
>>>> Computer Science Department               Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>
>>>
>>>
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16348.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to