Allan, I am glad things are working for you now. I can confirm (on a QEMU-emulated Versatile Express A9 board running Ubuntu 14.04) that disabling the "lo" interface reproduces the problem. I imagine this is true on other architectures, though I did not attempt to verify.
Ralph, If oob:tcp really does need the loopback interface, shouldn't its lack be something that could/should be detected and reported instead of hanging as Allan saw? FWIW, neither of the following resolved the problem: -mca oob_tcp_if_exclude lo -mca oob_tcp_if_include eth0 -Paul On Tue, Nov 25, 2014 at 2:58 PM, Allan Wu <al...@cs.ucla.edu> wrote: > I think I have found the problem. After inspecting the output with > > "-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca > oob_base_verbose 10 > 0 > " > > on both the old system and the new system, I noticed there is one line > > that is > > different > : > > o > n the old system where it works correctly, there is a line that says: > "oob:tcp:init rejecting loopback interface lo" > , > while > > on the new system there is no such line. Both system proceed to open > interface eth0 afterwards. Then I checked the new system, and found out > that somehow the loopback interface is not up by default. After I opened > the lo interface, the mpirun executes normally. > > Does it means that OpenMPI will use lo for some initial setup? Since the > actual socket was created on eth0 I did not think of checking the lo > interface. Anyway, thanks everyone for all of your kind help. Let me know > if you want me to provide any more information for future references. > > Regards, > Allan > > -- > Di Wu (Allan) > PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, > Department of Computer Science, UC Los Angeles > Email: al...@cs.ucla.edu > > On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu> wrote: > >> Thanks Ralph! >> >> I did not compile my openmpi with --enable-debug, and I am compiling it >> now. But your suggested command already provide >> d >> some output, which I attached with this email. >> >> It seems the process was stuck on the line: >> "[fpga2:00962] [[44848,1],0] waiting for connect completion to >> [[44848,0],0] - activating send event" >> >> Then it got stuck and I CTRL+C'ed it. Previous to that line, it said >> something about 'orte_tcp_peer_try_connect: attempting to connect to proc >> [[44848,0],0] via interface eth0' >> . >> >> >> Regards, >> Di >> >> On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> >>> This is all running on a single node, correct? If so, did you configure >>> OMPI with â EURO "enable-debug? >>> >>> If you can do that, or already have, then letâ EURO (tm)s add the following >>> to >>> the mpirun cmd line: >>> >>> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca >>> oob_base_verbose 10 >>> >>> Youâ EURO (tm)ll get a bunch of output, but hopefully it will tell us where >>> mpirun is encountering a problem. >>> Ralph >>> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov> >>> wrote: >>> >>>> Allan, >>>> >>>> If you send me the .config from your build of the kernel I can compare >>>> it against, for instance, my .config for a Raspberry Pi. >>>> There will certainly be many differences, but I am hoping my own >>>> experience configuring linux kernels will help me filter the "noise" from >>>> any differences that might be significant. >>>> >>>> -Paul >>>> >>>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>>> >>>>> Thanks Paul! Unfortunately '/boot' is not available in my embedded >>>>> linux, and I do not have the configuration file for the old kernel since >>>>> it >>>>> is provided as is. However, I have the new kernel configuration since I >>>>> compiled it myself. Would it be helpful if I provide you the .config file >>>>> when I compile the kernel? It maybe quite painful to look through that >>>>> file >>>>> though. Is there any other way that I can obtain the configuration? >>>>> >>>>> I checked my config for the new kernel, and UNIX-domain sockets and >>>>> Sys V IPC are both enabled in the build. Are there any other possibilities >>>>> I can check? >>>>> >>>>> Thanks, >>>>> Di >>>>> >>>>> -- >>>>> Di Wu (Allan) >>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>>>> Department of Computer Science, UC Los Angeles >>>>> Email: al...@cs.ucla.edu >>>>> >>>>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov> >>>>> wrote: >>>>> >>>>>> Allan, >>>>>> >>>>>> A likely possibility is that some important kernel feature (that Open >>>>>> MPI assumes is present) is missing. >>>>>> That includes not only "kernel modules" as you mention, but also >>>>>> features configure in (or out) of the base kernel. >>>>>> For instance, some embedded kernels omit UNIX-domain sockets and SysV >>>>>> IPC support. >>>>>> >>>>>> If you can send me (preferably off-list) the kernel config files for >>>>>> the old an new kernels I may be able to spot something. >>>>>> If present, you are looking for /boot/config-[VERSION] >>>>>> >>>>>> -Paul >>>>>> >>>>>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>>> >>>>>>> I'm sorry I forgot to change the subject when I reply to the digest >>>>>>> issue. Please find my original email below. >>>>>>> >>>>>>> Regards, >>>>>>> Di >>>>>>> >>>>>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Ralph for the reply. Sorry about the log file, I think I >>>>>>>> forgot to put an extension to the file. Please find a new one attached >>>>>>>> with >>>>>>>> this email. >>>>>>>> >>>>>>>> I'm sorry for not enough debugging information, but 'omp_info' and >>>>>>>> '--debug-devel' are the only ways I know for collecting information, >>>>>>>> are >>>>>>>> there any other things I can try to provide more info? >>>>>>>> >>>>>>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the >>>>>>>> output is the logging information in my last email. It got stuck at >>>>>>>> "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program >>>>>>>> is printed out to the screen. So I think it is mpirun failing to start >>>>>>>> my >>>>>>>> executable, not failing to terminate. >>>>>>>> >>>>>>>> I was wondering if this has anything to do with my newer kernel >>>>>>>> version, since it works well in the old case. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> -- >>>>>>>> Di Wu (Allan) >>>>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>>>>>>> Department of Computer Science, UC Los Angeles >>>>>>>> Email: al...@cs.ucla.edu >>>>>>>> >>>>>>>> >>>>>>>> Date: Tue, 25 Nov 2014 07:29:51 -0800 >>>>>>>> From: >>>>>>>> Ralph Castain <r...@open-mpi.org> >>>>>>>> To: Open MPI Developers <de...@open-mpi.org> >>>>>>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at >>>>>>>> execution on an embedded ARM Linux kernel version >>>>>>>> 3.15.0 >>>>>>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org> >>>>>>>> Content-Type: text/plain; charset="utf-8" >>>>>>>> >>>>>>>> I don?t know what you put in that log file, but it was an >>>>>>>> executable and I?m not feeling that trusting :-) >>>>>>>> >>>>>>>> I?m afraid there isn?t enough debug output there to really tell >>>>>>>> anything. From what little I can see, I?m guessing that the >>>>>>>> application ran >>>>>>>> fine and you got the usual ?hello? output and the helloworld process >>>>>>>> exited >>>>>>>> safely - is that correct? And so it is solely mpirun that is failing to >>>>>>>> cleanly terminate? >>>>>>>> >>>>>>>> >>>>>>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>>>>> > >>>>>>>> > Hello everyone, >>>>>>>> > >>>>>>>> > I have cross-compiled OpenMPI for an embedded ARM Linux. >>>>>>>> Everything works fine for my system based on Linux 3.8.0. I have >>>>>>>> previously >>>>>>>> submitted a post related to my compilation, which can be found here: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php < >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. >>>>>>>> When I recently upgraded my Linux kernel to 3.15.0, mpirun begins to >>>>>>>> stuck >>>>>>>> at even the helloworld program. The program consists only simple APIs: >>>>>>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem >>>>>>>> occurs >>>>>>>> even at 'mpirun -np 1 ./helloworld', and below are the output with >>>>>>>> --debug-devel (before it got stuck): >>>>>>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty - >>>>>>>> leaving >>>>>>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@ >>>>>>>> fpga1_0/63813/0/0 >>>>>>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0 >>>>>>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0 >>>>>>>> > [fpga1:00716] tmp: /tmp >>>>>>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@ >>>>>>>> fpga1_0/63813/1/0 >>>>>>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1 >>>>>>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0 >>>>>>>> > >>>>>>>> [fpga1:00718] tmp: /tmp >>>>>>>> > >>>>>>>> > I suspect maybe it is due to incompatible kernel version or some >>>>>>>> missing kernel modules. I tried also with the latest version 1.8.3, >>>>>>>> and had >>>>>>>> the same problem. Does anyone have any thoughts? I have attached the >>>>>>>> output >>>>>>>> of 'ompi-info --all' with this email. >>>>>>>> > >>>>>>>> > Please let me know if I need to provide more information. Thanks >>>>>>>> in advance! >>>>>>>> > >>>>>>>> > Regards, >>>>>>>> > -- >>>>>>>> > Di Wu (Allan) >>>>>>>> > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/>, >>>>>>>> > Department of Computer Science, UC Los Angeles >>>>>>>> > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> >>>>>>>> > <log.tar.gz>_______________________________________________ >>>>>>>> > devel mailing list >>>>>>>> > de...@open-mpi.org >>>>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> > Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Computer Languages & Systems Software (CLaSS) Group >>>> Computer Science Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>> >>> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16348.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900