Is this a bug running open-mpi over heterogeneous environments (between a mac and linux) over wireless links. Please suggest what needs to be done or what I am missing.?! Any clues as to how to debug this will be of great help. thanks and regards, pallab
> Hi Rolf, > > I ran the following: > > pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca > btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca > btl_tcp_if_include en0,wlan0 -np 2 -hetero -H localhost,10.11.14.205 > /tmp/hello > > [fuji.local:02267] mca: base: components_open: Looking for btl components > [fuji.local:02267] mca: base: components_open: opening btl components > [fuji.local:02267] mca: base: components_open: found loaded component self > [fuji.local:02267] mca: base: components_open: component self has no > register function > [fuji.local:02267] mca: base: components_open: component self open > function successful > [fuji.local:02267] mca: base: components_open: found loaded component sm > [fuji.local:02267] mca: base: components_open: component sm has no > register function > [fuji.local:02267] mca: base: components_open: component sm open function > successful > [fuji.local:02267] mca: base: components_open: found loaded component tcp > [fuji.local:02267] mca: base: components_open: component tcp has no > register function > [fuji.local:02267] mca: base: components_open: component tcp open function > successful > [fuji.local:02267] select: initializing btl component self > [fuji.local:02267] select: init of component self returned success > [fuji.local:02267] select: initializing btl component sm > [fuji.local:02267] select: init of component sm returned success > [fuji.local:02267] select: initializing btl component tcp > [fuji.local][[59424,1],0][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] > invalid interface "wlan0" > [fuji.local:02267] select: init of component tcp returned success > [apex-backpack:31956] mca: base: components_open: Looking for btl > components > [apex-backpack:31956] mca: base: components_open: opening btl components > [apex-backpack:31956] mca: base: components_open: found loaded component > self > [apex-backpack:31956] mca: base: components_open: component self has no > register function > [apex-backpack:31956] mca: base: components_open: component self open > function successful > [apex-backpack:31956] mca: base: components_open: found loaded component > sm > [apex-backpack:31956] mca: base: components_open: component sm has no > register function > [apex-backpack:31956] mca: base: components_open: component sm open > function successful > [apex-backpack:31956] mca: base: components_open: found loaded component > tcp > [apex-backpack:31956] mca: base: components_open: component tcp has no > register function > [apex-backpack:31956] mca: base: components_open: component tcp open > function successful > [apex-backpack:31956] select: initializing btl component self > [apex-backpack:31956] select: init of component self returned success > [apex-backpack:31956] select: initializing btl component sm > [apex-backpack:31956] select: init of component sm returned success > [apex-backpack:31956] select: initializing btl component tcp > [apex-backpack][[59424,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] > invalid interface "en0" > [apex-backpack:31956] select: init of component tcp returned success > Process 0 on fuji.local out of 2 > Process 1 on apex-backpack out of 2 > [apex-backpack:31956] btl: tcp: attempting to connect() to address > 10.11.14.203 on port 9360 > > > > It launches the processes on both ends and then it hangs at the send > receive part..!! > What is the other thing that you were mentioning which makes you think > that its not working?!? > Please suggest.. > --regards, pallab > > > >> The -enable-heterogeneous should do the trick. And to answer the >> previous question, yes, put both of the interfaces in the include list. >> >> --mca btl_tcp_if_include en0,wlan0 >> >> If that does not work, then I may have one other thought why it might >> not work although perhaps not a solution. >> >> Rolf >> >> Pallab Datta wrote: >>> Hi Rolf, >>> >>> Do i need to configure openmpi with some specific options apart from >>> --enable-heterogeneous..? >>> I am currently using >>> ./configure --prefix=/usr/local/ --enable-heterogeneous >>> --disable-static >>> --enable-shared --enable-debug >>> >>> on both ends...is the above correct..?! Please let me know. >>> thanks and regards, >>> pallab >>> >>> >>>> Hi: >>>> I assume if you wait several minutes than your program will actually >>>> time out, yes? I guess I have two suggestions. First, can you run a >>>> non-MPI job using the wireless? Something like hostname? Secondly, >>>> you >>>> may want to specify the specific interfaces you want it to use on the >>>> two machines. You can do that via the "--mca btl_tcp_if_include" >>>> run-time parameter. Just list the ones that you expect it to use. >>>> >>>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all 1" It >>>> should be --mca mpi_preconnect_mpi 1 if you want to do the connection >>>> during MPI_Init. >>>> >>>> Rolf >>>> >>>> Pallab Datta wrote: >>>> >>>>> The following is the error dump >>>>> >>>>> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 >>>>> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca >>>>> btl >>>>> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H >>>>> localhost,10.11.14.205 /tmp/hello >>>>> [fuji.local:01316] mca: base: components_open: Looking for btl >>>>> components >>>>> [fuji.local:01316] mca: base: components_open: opening btl components >>>>> [fuji.local:01316] mca: base: components_open: found loaded component >>>>> self >>>>> [fuji.local:01316] mca: base: components_open: component self has no >>>>> register function >>>>> [fuji.local:01316] mca: base: components_open: component self open >>>>> function successful >>>>> [fuji.local:01316] mca: base: components_open: found loaded component >>>>> tcp >>>>> [fuji.local:01316] mca: base: components_open: component tcp has no >>>>> register function >>>>> [fuji.local:01316] mca: base: components_open: component tcp open >>>>> function >>>>> successful >>>>> [fuji.local:01316] select: initializing btl component self >>>>> [fuji.local:01316] select: init of component self returned success >>>>> [fuji.local:01316] select: initializing btl component tcp >>>>> [fuji.local:01316] select: init of component tcp returned success >>>>> [apex-backpack:04753] mca: base: components_open: Looking for btl >>>>> components >>>>> [apex-backpack:04753] mca: base: components_open: opening btl >>>>> components >>>>> [apex-backpack:04753] mca: base: components_open: found loaded >>>>> component >>>>> self >>>>> [apex-backpack:04753] mca: base: components_open: component self has >>>>> no >>>>> register function >>>>> [apex-backpack:04753] mca: base: components_open: component self open >>>>> function successful >>>>> [apex-backpack:04753] mca: base: components_open: found loaded >>>>> component >>>>> tcp >>>>> [apex-backpack:04753] mca: base: components_open: component tcp has >>>>> no >>>>> register function >>>>> [apex-backpack:04753] mca: base: components_open: component tcp open >>>>> function successful >>>>> [apex-backpack:04753] select: initializing btl component self >>>>> [apex-backpack:04753] select: init of component self returned success >>>>> [apex-backpack:04753] select: initializing btl component tcp >>>>> [apex-backpack:04753] select: init of component tcp returned success >>>>> Process 0 on fuji.local out of 2 >>>>> Process 1 on apex-backpack out of 2 >>>>> [apex-backpack:04753] btl: tcp: attempting to connect() to address >>>>> 10.11.14.203 on port 9360 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Hi >>>>>> >>>>>> I am trying to run open-mpi 1.3.3. between a linux box running >>>>>> ubuntu >>>>>> server v.9.04 and a Macintosh. I have configured openmpi with the >>>>>> following options.: >>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous >>>>>> --disable-shared >>>>>> --enable-static >>>>>> >>>>>> When both the machines are connected to the network via ethernet >>>>>> cables >>>>>> openmpi works fine. >>>>>> >>>>>> But when I switch the linux box to a wireless adapter i can reach >>>>>> (ping) >>>>>> the macintosh >>>>>> but openmpi hangs on a hello world program. >>>>>> >>>>>> I ran : >>>>>> >>>>>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca >>>>>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca >>>>>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H >>>>>> localhost,10.11.14.205 >>>>>> /tmp/back >>>>>> >>>>>> it hangs on a send receive function between the two ends. All my >>>>>> firewalls >>>>>> are turned off at the macintosh end. PLEASE HELP ASAP> >>>>>> regards, >>>>>> pallab >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>> -- >>>> >>>> ========================= >>>> rolf.vandeva...@sun.com >>>> 781-442-3043 >>>> ========================= >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >> >> >> -- >> >> ========================= >> rolf.vandeva...@sun.com >> 781-442-3043 >> ========================= >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >