Is this a bug running open-mpi over heterogeneous environments (between a
mac and linux) over wireless links.
Please suggest what needs to be done or what I am missing.?!
Any clues as to how to debug this will be of great help.
thanks and regards, pallab

> Hi Rolf,
>
> I ran the following:
>
> pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
> btl_tcp_if_include en0,wlan0 -np 2 -hetero -H localhost,10.11.14.205
> /tmp/hello
>
> [fuji.local:02267] mca: base: components_open: Looking for btl components
> [fuji.local:02267] mca: base: components_open: opening btl components
> [fuji.local:02267] mca: base: components_open: found loaded component self
> [fuji.local:02267] mca: base: components_open: component self has no
> register function
> [fuji.local:02267] mca: base: components_open: component self open
> function successful
> [fuji.local:02267] mca: base: components_open: found loaded component sm
> [fuji.local:02267] mca: base: components_open: component sm has no
> register function
> [fuji.local:02267] mca: base: components_open: component sm open function
> successful
> [fuji.local:02267] mca: base: components_open: found loaded component tcp
> [fuji.local:02267] mca: base: components_open: component tcp has no
> register function
> [fuji.local:02267] mca: base: components_open: component tcp open function
> successful
> [fuji.local:02267] select: initializing btl component self
> [fuji.local:02267] select: init of component self returned success
> [fuji.local:02267] select: initializing btl component sm
> [fuji.local:02267] select: init of component sm returned success
> [fuji.local:02267] select: initializing btl component tcp
> [fuji.local][[59424,1],0][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
> invalid interface "wlan0"
> [fuji.local:02267] select: init of component tcp returned success
> [apex-backpack:31956] mca: base: components_open: Looking for btl
> components
> [apex-backpack:31956] mca: base: components_open: opening btl components
> [apex-backpack:31956] mca: base: components_open: found loaded component
> self
> [apex-backpack:31956] mca: base: components_open: component self has no
> register function
> [apex-backpack:31956] mca: base: components_open: component self open
> function successful
> [apex-backpack:31956] mca: base: components_open: found loaded component
> sm
> [apex-backpack:31956] mca: base: components_open: component sm has no
> register function
> [apex-backpack:31956] mca: base: components_open: component sm open
> function successful
> [apex-backpack:31956] mca: base: components_open: found loaded component
> tcp
> [apex-backpack:31956] mca: base: components_open: component tcp has no
> register function
> [apex-backpack:31956] mca: base: components_open: component tcp open
> function successful
> [apex-backpack:31956] select: initializing btl component self
> [apex-backpack:31956] select: init of component self returned success
> [apex-backpack:31956] select: initializing btl component sm
> [apex-backpack:31956] select: init of component sm returned success
> [apex-backpack:31956] select: initializing btl component tcp
> [apex-backpack][[59424,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
> invalid interface "en0"
> [apex-backpack:31956] select: init of component tcp returned success
> Process 0 on fuji.local out of 2
> Process 1 on apex-backpack out of 2
> [apex-backpack:31956] btl: tcp: attempting to connect() to address
> 10.11.14.203 on port 9360
>
>
>
> It launches the processes on both ends and then it hangs at the send
> receive part..!!
> What is the other thing that you were mentioning which makes you think
> that its not working?!?
> Please suggest..
> --regards, pallab
>
>
>
>> The -enable-heterogeneous should do the trick.  And to answer the
>> previous question, yes, put both of the interfaces in the include list.
>>
>> --mca btl_tcp_if_include en0,wlan0
>>
>> If that does not work, then I may have one other thought why it might
>> not work although perhaps not a solution.
>>
>> Rolf
>>
>> Pallab Datta wrote:
>>> Hi Rolf,
>>>
>>> Do i need to configure openmpi with some specific options apart from
>>> --enable-heterogeneous..?
>>> I am currently using
>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>> --disable-static
>>> --enable-shared --enable-debug
>>>
>>> on both ends...is the above correct..?! Please let me know.
>>> thanks and regards,
>>> pallab
>>>
>>>
>>>> Hi:
>>>> I assume if you wait several minutes than your program will actually
>>>> time out, yes?  I guess I have two suggestions. First, can you run a
>>>> non-MPI job using the wireless?  Something like hostname?  Secondly,
>>>> you
>>>> may want to specify the specific interfaces you want it to use on the
>>>> two machines.  You can do that via the "--mca btl_tcp_if_include"
>>>> run-time parameter.  Just list the ones that you expect it to use.
>>>>
>>>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all 1"  It
>>>> should be --mca mpi_preconnect_mpi 1 if you want to do the connection
>>>> during MPI_Init.
>>>>
>>>> Rolf
>>>>
>>>> Pallab Datta wrote:
>>>>
>>>>> The following is the error dump
>>>>>
>>>>> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4
>>>>> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
>>>>> btl
>>>>> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>>>>> localhost,10.11.14.205 /tmp/hello
>>>>> [fuji.local:01316] mca: base: components_open: Looking for btl
>>>>> components
>>>>> [fuji.local:01316] mca: base: components_open: opening btl components
>>>>> [fuji.local:01316] mca: base: components_open: found loaded component
>>>>> self
>>>>> [fuji.local:01316] mca: base: components_open: component self has no
>>>>> register function
>>>>> [fuji.local:01316] mca: base: components_open: component self open
>>>>> function successful
>>>>> [fuji.local:01316] mca: base: components_open: found loaded component
>>>>> tcp
>>>>> [fuji.local:01316] mca: base: components_open: component tcp has no
>>>>> register function
>>>>> [fuji.local:01316] mca: base: components_open: component tcp open
>>>>> function
>>>>> successful
>>>>> [fuji.local:01316] select: initializing btl component self
>>>>> [fuji.local:01316] select: init of component self returned success
>>>>> [fuji.local:01316] select: initializing btl component tcp
>>>>> [fuji.local:01316] select: init of component tcp returned success
>>>>> [apex-backpack:04753] mca: base: components_open: Looking for btl
>>>>> components
>>>>> [apex-backpack:04753] mca: base: components_open: opening btl
>>>>> components
>>>>> [apex-backpack:04753] mca: base: components_open: found loaded
>>>>> component
>>>>> self
>>>>> [apex-backpack:04753] mca: base: components_open: component self has
>>>>> no
>>>>> register function
>>>>> [apex-backpack:04753] mca: base: components_open: component self open
>>>>> function successful
>>>>> [apex-backpack:04753] mca: base: components_open: found loaded
>>>>> component
>>>>> tcp
>>>>> [apex-backpack:04753] mca: base: components_open: component tcp has
>>>>> no
>>>>> register function
>>>>> [apex-backpack:04753] mca: base: components_open: component tcp open
>>>>> function successful
>>>>> [apex-backpack:04753] select: initializing btl component self
>>>>> [apex-backpack:04753] select: init of component self returned success
>>>>> [apex-backpack:04753] select: initializing btl component tcp
>>>>> [apex-backpack:04753] select: init of component tcp returned success
>>>>> Process 0 on fuji.local out of 2
>>>>> Process 1 on apex-backpack out of 2
>>>>> [apex-backpack:04753] btl: tcp: attempting to connect() to address
>>>>> 10.11.14.203 on port 9360
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I am trying to run open-mpi 1.3.3. between a linux box running
>>>>>> ubuntu
>>>>>> server v.9.04 and a Macintosh. I have configured openmpi with the
>>>>>> following options.:
>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>>>>> --disable-shared
>>>>>> --enable-static
>>>>>>
>>>>>> When both the machines are connected to the network via ethernet
>>>>>> cables
>>>>>> openmpi works fine.
>>>>>>
>>>>>> But when I switch the linux box to a wireless adapter i can reach
>>>>>> (ping)
>>>>>> the macintosh
>>>>>> but openmpi hangs on a hello world program.
>>>>>>
>>>>>> I ran :
>>>>>>
>>>>>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
>>>>>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
>>>>>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>>>>>> localhost,10.11.14.205
>>>>>> /tmp/back
>>>>>>
>>>>>> it hangs on a send receive function between the two ends. All my
>>>>>> firewalls
>>>>>> are turned off at the macintosh end. PLEASE HELP ASAP>
>>>>>> regards,
>>>>>> pallab
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>> --
>>>>
>>>> =========================
>>>> rolf.vandeva...@sun.com
>>>> 781-442-3043
>>>> =========================
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>
>>
>> --
>>
>> =========================
>> rolf.vandeva...@sun.com
>> 781-442-3043
>> =========================
>>
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to