I think I have found the problem. After inspecting the output with "-mca state_base_verbose 10 -mca odls_base_verbose 10 -mca oob_base_verbose 10 0 " on both the old system and the new system, I noticed there is one line that is different :
o n the old system where it works correctly, there is a line that says: "oob:tcp:init rejecting loopback interface lo" , while on the new system there is no such line. Both system proceed to open interface eth0 afterwards. Then I checked the new system, and found out that somehow the loopback interface is not up by default. After I opened the lo interface, the mpirun executes normally. Does it means that OpenMPI will use lo for some initial setup? Since the actual socket was created on eth0 I did not think of checking the lo interface. Anyway, thanks everyone for all of your kind help. Let me know if you want me to provide any more information for future references. Regards, Allan -- Di Wu (Allan) PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, Department of Computer Science, UC Los Angeles Email: al...@cs.ucla.edu On Tue, Nov 25, 2014 at 11:55 AM, Allan Wu <al...@cs.ucla.edu> wrote: > Thanks Ralph! > > I did not compile my openmpi with --enable-debug, and I am compiling it > now. But your suggested command already provide > d > some output, which I attached with this email. > > It seems the process was stuck on the line: > "[fpga2:00962] [[44848,1],0] waiting for connect completion to > [[44848,0],0] - activating send event" > > Then it got stuck and I CTRL+C'ed it. Previous to that line, it said > something about 'orte_tcp_peer_try_connect: attempting to connect to proc > [[44848,0],0] via interface eth0' > . > > > Regards, > Di > > On Tue, Nov 25, 2014 at 2:25 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> >> This is all running on a single node, correct? If so, did you configure >> OMPI with —enable-debug? >> >> If you can do that, or already have, then let’s add the following to >> the mpirun cmd line: >> >> -mca state_base_verbose 10 -mca odls_base_verbose 10 -mca >> oob_base_verbose 10 >> >> You’ll get a bunch of output, but hopefully it will tell us where >> mpirun is encountering a problem. >> Ralph >> On Tue, Nov 25, 2014 at 11:20 AM, Paul Hargrove <phhargr...@lbl.gov> >> wrote: >> >>> Allan, >>> >>> If you send me the .config from your build of the kernel I can compare >>> it against, for instance, my .config for a Raspberry Pi. >>> There will certainly be many differences, but I am hoping my own >>> experience configuring linux kernels will help me filter the "noise" from >>> any differences that might be significant. >>> >>> -Paul >>> >>> On Tue, Nov 25, 2014 at 11:11 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>> >>>> Thanks Paul! Unfortunately '/boot' is not available in my embedded >>>> linux, and I do not have the configuration file for the old kernel since it >>>> is provided as is. However, I have the new kernel configuration since I >>>> compiled it myself. Would it be helpful if I provide you the .config file >>>> when I compile the kernel? It maybe quite painful to look through that file >>>> though. Is there any other way that I can obtain the configuration? >>>> >>>> I checked my config for the new kernel, and UNIX-domain sockets and Sys >>>> V IPC are both enabled in the build. Are there any other possibilities I >>>> can check? >>>> >>>> Thanks, >>>> Di >>>> >>>> -- >>>> Di Wu (Allan) >>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>>> Department of Computer Science, UC Los Angeles >>>> Email: al...@cs.ucla.edu >>>> >>>> On Tue, Nov 25, 2014 at 10:45 AM, Paul Hargrove <phhargr...@lbl.gov> >>>> wrote: >>>> >>>>> Allan, >>>>> >>>>> A likely possibility is that some important kernel feature (that Open >>>>> MPI assumes is present) is missing. >>>>> That includes not only "kernel modules" as you mention, but also >>>>> features configure in (or out) of the base kernel. >>>>> For instance, some embedded kernels omit UNIX-domain sockets and SysV >>>>> IPC support. >>>>> >>>>> If you can send me (preferably off-list) the kernel config files for >>>>> the old an new kernels I may be able to spot something. >>>>> If present, you are looking for /boot/config-[VERSION] >>>>> >>>>> -Paul >>>>> >>>>> On Tue, Nov 25, 2014 at 10:25 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>> >>>>>> I'm sorry I forgot to change the subject when I reply to the digest >>>>>> issue. Please find my original email below. >>>>>> >>>>>> Regards, >>>>>> Di >>>>>> >>>>>> On Tue, Nov 25, 2014 at 10:19 AM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>>> >>>>>>> Thanks Ralph for the reply. Sorry about the log file, I think I >>>>>>> forgot to put an extension to the file. Please find a new one attached >>>>>>> with >>>>>>> this email. >>>>>>> >>>>>>> I'm sorry for not enough debugging information, but 'omp_info' and >>>>>>> '--debug-devel' are the only ways I know for collecting information, are >>>>>>> there any other things I can try to provide more info? >>>>>>> >>>>>>> When I execute 'mpirun --debug-devel -np 1 ./helloworld', all the >>>>>>> output is the logging information in my last email. It got stuck at >>>>>>> "[fpga1:00718] tmp: /tmp", and nothing from my helloworld program >>>>>>> is printed out to the screen. So I think it is mpirun failing to start >>>>>>> my >>>>>>> executable, not failing to terminate. >>>>>>> >>>>>>> I was wondering if this has anything to do with my newer kernel >>>>>>> version, since it works well in the old case. >>>>>>> >>>>>>> Thanks, >>>>>>> -- >>>>>>> Di Wu (Allan) >>>>>>> PhD student, VAST Laboratory <http://vast.cs.ucla.edu/>, >>>>>>> Department of Computer Science, UC Los Angeles >>>>>>> Email: al...@cs.ucla.edu >>>>>>> >>>>>>> >>>>>>> Date: Tue, 25 Nov 2014 07:29:51 -0800 >>>>>>> From: >>>>>>> Ralph Castain <r...@open-mpi.org> >>>>>>> To: Open MPI Developers <de...@open-mpi.org> >>>>>>> Subject: Re: [OMPI devel] OpenMPI v1.8 and v1.8.3 mpirun hangs at >>>>>>> execution on an embedded ARM Linux kernel version >>>>>>> 3.15.0 >>>>>>> Message-ID: <898cb117-f6a6-4569-89c3-49b75d65b...@open-mpi.org> >>>>>>> Content-Type: text/plain; charset="utf-8" >>>>>>> >>>>>>> I don?t know what you put in that log file, but it was an executable >>>>>>> and I?m not feeling that trusting :-) >>>>>>> >>>>>>> I?m afraid there isn?t enough debug output there to really tell >>>>>>> anything. From what little I can see, I?m guessing that the application >>>>>>> ran >>>>>>> fine and you got the usual ?hello? output and the helloworld process >>>>>>> exited >>>>>>> safely - is that correct? And so it is solely mpirun that is failing to >>>>>>> cleanly terminate? >>>>>>> >>>>>>> >>>>>>> > On Nov 24, 2014, at 11:24 PM, Allan Wu <al...@cs.ucla.edu> wrote: >>>>>>> > >>>>>>> > Hello everyone, >>>>>>> > >>>>>>> > I have cross-compiled OpenMPI for an embedded ARM Linux. >>>>>>> Everything works fine for my system based on Linux 3.8.0. I have >>>>>>> previously >>>>>>> submitted a post related to my compilation, which can be found here: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php < >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/04/14440.php>. >>>>>>> When I recently upgraded my Linux kernel to 3.15.0, mpirun begins to >>>>>>> stuck >>>>>>> at even the helloworld program. The program consists only simple APIs: >>>>>>> MPI_Init, MPI_Comm_size, MPI_Comm_rank, MPI_Finalize. The problem occurs >>>>>>> even at 'mpirun -np 1 ./helloworld', and below are the output with >>>>>>> --debug-devel (before it got stuck): >>>>>>> > [fpga1:00716] sess_dir_finalize: job session dir not empty - >>>>>>> leaving >>>>>>> > [fpga1:00716] procdir: /tmp/openmpi-sessions-root@ >>>>>>> fpga1_0/63813/0/0 >>>>>>> > [fpga1:00716] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/0 >>>>>>> > [fpga1:00716] top: openmpi-sessions-root@fpga1_0 >>>>>>> > [fpga1:00716] tmp: /tmp >>>>>>> > [fpga1:00718] procdir: /tmp/openmpi-sessions-root@ >>>>>>> fpga1_0/63813/1/0 >>>>>>> > [fpga1:00718] jobdir: /tmp/openmpi-sessions-root@fpga1_0/63813/1 >>>>>>> > [fpga1:00718] top: openmpi-sessions-root@fpga1_0 >>>>>>> > >>>>>>> [fpga1:00718] tmp: /tmp >>>>>>> > >>>>>>> > I suspect maybe it is due to incompatible kernel version or some >>>>>>> missing kernel modules. I tried also with the latest version 1.8.3, and >>>>>>> had >>>>>>> the same problem. Does anyone have any thoughts? I have attached the >>>>>>> output >>>>>>> of 'ompi-info --all' with this email. >>>>>>> > >>>>>>> > Please let me know if I need to provide more information. Thanks >>>>>>> in advance! >>>>>>> > >>>>>>> > Regards, >>>>>>> > -- >>>>>>> > Di Wu (Allan) >>>>>>> > PhD student, VAST?Laboratory <http://vast.cs.ucla.edu/>, >>>>>>> > Department of Computer Science, UC Los Angeles >>>>>>> > Email: al...@cs.ucla.edu <mailto:al...@cs.ucla.edu> >>>>>>> > <log.tar.gz>_______________________________________________ >>>>>>> > devel mailing list >>>>>>> > de...@open-mpi.org >>>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> > Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16330.php >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16341.php >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>> Computer Languages & Systems Software (CLaSS) Group >>>>> Computer Science Department Tel: +1-510-495-2352 >>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>> >>>> >>>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >> >> >