Re: [OMPI devel] openib oob module

2017-04-21 Thread Shiqing Fan
The gap between these two versions is quite huge. I will first try to debug a 
bit more in 1.10. 

Regards,
Shiqing

-Original Message-
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Friday, April 21, 2017 4:02 PM
To: OpenMPI Devel
Subject: Re: [OMPI devel] openib oob module

I’m not familiar with the openib code, but this looks to me like it may be 
caused by a change in the openib code itself. Have you looked to see what the 
diff might be between the two versions?

> On Apr 21, 2017, at 6:45 AM, Shiqing Fan  wrote:
> 
> I've tried this out, and got the same problem as I sent before. 
> 
> With the same configuration and command line, 1.6.5 works for me, 1.10 series 
> seem not.
> 
> Could it also be IB configuration issue? (ib_write/read_bw/lat work 
> fine across the two nodes)
> 
> Error output below:
> 
> [[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 
> to: 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
> status number 12 for wr_id 2318d80 opcode 32767  vendor error 129 
> qp_idx 0
> 
> --
>  The InfiniBand retry count between two MPI processes has been 
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2 
> (section 12.7.38):
> 
>The total number of times that the sender wishes the receiver to
>retry timeout, packet sequence, etc. errors before posting a
>completion error.
> 
> This error typically means that there is something awry within the 
> InfiniBand fabric itself.  You should note the hosts on which this 
> error has occurred; it has been observed that rebooting or removing a 
> particular host from the job can sometimes resolve this issue.
> 
> Two MCA parameters can be used to control Open MPI's behavior with 
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will  
> attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted  
> to 20).  The actual timeout value used is calculated as:
> 
> 4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> 
> Below is some information about the host that raised the error and the 
> peer to which it was connected:
> 
>  Local host:   host1
>  Local device: mlx4_0
>  Peer host:192.168.2.22
> 
> You may need to consult with your system administrator to get this 
> problem fixed.
> --
> 
> 
> -Original Message-
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
> Gilles Gouaillardet
> Sent: Friday, April 21, 2017 9:41 AM
> To: devel@lists.open-mpi.org
> Subject: Re: [OMPI devel] openib oob module
> 
> Folks,
> 
> 
> fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works 
> for me on a mlx4 cluster (Mellanox QDR)
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
>> I’m not seeing any problem inside the OOB - the problem appears to be 
>> in the info being given to it:
>> 
>> [host1:16244] 1 more process has sent help message 
>> help-mpi-btl-openib.txt / default subnet prefix [host1:16244] Set MCA 
>> parameter "orte_base_help_aggregate" to 0 to see all help / error 
>> messages [[46697,1],0][btl_openib_component.c:3501:handle_wc] from 
>> host1 to:
>> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
>> status number 12 for wr_id 112db80 opcode 32767  vendor error 129 
>> qp_idx 0
>> 
>> I’ve been searching, and I don’t see that help message anywhere in 
>> your output - not sure what happened to it. I do see this in your 
>> output - don’t know what it means:
>> 
>> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb
>> ]
>> !
>> 
>> 
>>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan >> > wrote:
>>> 
>>> Forgot to enable oob verbose in my last test. Here is the updated 
>>> output file.
>>> Thanks,
>>> Shiqing
>>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf 
>>> Of*r...@open-mpi.org  *Sent:*Thursday, April 
>>> 20, 2017 4:29 PM *To:*OpenMPI Devel
>>> *Subject:*Re: [OMPI devel] openib oob module Yeah, I forgot that the 
>>> 1.10 series still had the BTLs in OMPI.
>>> Should be able to restore it. I honestly don’t recall the bug, 
>>> though :-( If you want to try reviving it, you can add some debug in 
>>> there (plus turn on the OOB verbosity) and I’m happy to help you figure it 
>>> out.
>>> Ralph
>>> 
>>>On Apr 20, 2017, at 7:13 AM, Shiqing Fan >>> wrote:
>>>Hi Ralph,
>>>Yes, it’s been a long time. Hope you all are doing well (I
>>>believe soJ).
>>>I’m working on a virtualization project, and need to run Open MPI
>>>on an unikernel OS (most of OFED

Re: [OMPI devel] openib oob module

2017-04-21 Thread r...@open-mpi.org
I’m not familiar with the openib code, but this looks to me like it may be 
caused by a change in the openib code itself. Have you looked to see what the 
diff might be between the two versions?

> On Apr 21, 2017, at 6:45 AM, Shiqing Fan  wrote:
> 
> I've tried this out, and got the same problem as I sent before. 
> 
> With the same configuration and command line, 1.6.5 works for me, 1.10 series 
> seem not.
> 
> Could it also be IB configuration issue? (ib_write/read_bw/lat work fine 
> across the two nodes)
> 
> Error output below:
> 
> [[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 to: 
> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status 
> number 12 for wr_id 2318d80 opcode 32767  vendor error 129 qp_idx 0
> 
> --
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
> 
>The total number of times that the sender wishes the receiver to
>retry timeout, packet sequence, etc. errors before posting a
>completion error.
> 
> This error typically means that there is something awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
> 
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>  attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>  to 20).  The actual timeout value used is calculated as:
> 
> 4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> 
> Below is some information about the host that raised the error and the
> peer to which it was connected:
> 
>  Local host:   host1
>  Local device: mlx4_0
>  Peer host:192.168.2.22
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> --
> 
> -Original Message-
> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles 
> Gouaillardet
> Sent: Friday, April 21, 2017 9:41 AM
> To: devel@lists.open-mpi.org
> Subject: Re: [OMPI devel] openib oob module
> 
> Folks,
> 
> 
> fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for me 
> on a mlx4 cluster (Mellanox QDR)
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
>> I’m not seeing any problem inside the OOB - the problem appears to be 
>> in the info being given to it:
>> 
>> [host1:16244] 1 more process has sent help message 
>> help-mpi-btl-openib.txt / default subnet prefix
>> [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>> all help / error messages
>> [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 
>> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
>> status number 12 for wr_id 112db80 opcode 32767  vendor error 129 qp_idx 0
>> 
>> I’ve been searching, and I don’t see that help message anywhere in 
>> your output - not sure what happened to it. I do see this in your 
>> output - don’t know what it means:
>> 
>> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] 
>> !
>> 
>> 
>>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan >> > wrote:
>>> 
>>> Forgot to enable oob verbose in my last test. Here is the updated 
>>> output file.
>>> Thanks,
>>> Shiqing
>>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf 
>>> Of*r...@open-mpi.org 
>>> *Sent:*Thursday, April 20, 2017 4:29 PM
>>> *To:*OpenMPI Devel
>>> *Subject:*Re: [OMPI devel] openib oob module
>>> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. 
>>> Should be able to restore it. I honestly don’t recall the bug, though :-(
>>> If you want to try reviving it, you can add some debug in there (plus 
>>> turn on the OOB verbosity) and I’m happy to help you figure it out.
>>> Ralph
>>> 
>>>On Apr 20, 2017, at 7:13 AM, Shiqing Fan >>> wrote:
>>>Hi Ralph,
>>>Yes, it’s been a long time. Hope you all are doing well (I
>>>believe soJ).
>>>I’m working on a virtualization project, and need to run Open MPI
>>>on an unikernel OS (most of OFED is missing/unsupported).
>>>Actually I’m only focusing on 1.10.2, which still has oob in
>>>ompi. Probably it might be possible to make oob work there? Or
>>>even for 1.10 branch (as Gilles metioned)?
>>>Do you have any clue about the bug in oob back then?
>>>Regards,
>>>Shiqing
>>>*From:*devel [mailto:devel-boun...@lists.open-

Re: [OMPI devel] openib oob module

2017-04-21 Thread Shiqing Fan
I've tried this out, and got the same problem as I sent before. 

With the same configuration and command line, 1.6.5 works for me, 1.10 series 
seem not.

Could it also be IB configuration issue? (ib_write/read_bw/lat work fine across 
the two nodes)

Error output below:

[[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 to: 
192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status number 
12 for wr_id 2318d80 opcode 32767  vendor error 129 qp_idx 0

--
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 20).  The actual timeout value used is calculated as:

 4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   host1
  Local device: mlx4_0
  Peer host:192.168.2.22

You may need to consult with your system administrator to get this
problem fixed.
--

-Original Message-
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Friday, April 21, 2017 9:41 AM
To: devel@lists.open-mpi.org
Subject: Re: [OMPI devel] openib oob module

Folks,


fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for me on 
a mlx4 cluster (Mellanox QDR)


Cheers,


Gilles


On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
> I’m not seeing any problem inside the OOB - the problem appears to be 
> in the info being given to it:
>
> [host1:16244] 1 more process has sent help message 
> help-mpi-btl-openib.txt / default subnet prefix
> [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
> all help / error messages
> [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 
> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
> status number 12 for wr_id 112db80 opcode 32767  vendor error 129 qp_idx 0
>
> I’ve been searching, and I don’t see that help message anywhere in 
> your output - not sure what happened to it. I do see this in your 
> output - don’t know what it means:
>
> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] 
> !
>
>
>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan > > wrote:
>>
>> Forgot to enable oob verbose in my last test. Here is the updated 
>> output file.
>> Thanks,
>> Shiqing
>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf 
>> Of*r...@open-mpi.org 
>> *Sent:*Thursday, April 20, 2017 4:29 PM
>> *To:*OpenMPI Devel
>> *Subject:*Re: [OMPI devel] openib oob module
>> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. 
>> Should be able to restore it. I honestly don’t recall the bug, though :-(
>> If you want to try reviving it, you can add some debug in there (plus 
>> turn on the OOB verbosity) and I’m happy to help you figure it out.
>> Ralph
>>
>> On Apr 20, 2017, at 7:13 AM, Shiqing Fan > > wrote:
>> Hi Ralph,
>> Yes, it’s been a long time. Hope you all are doing well (I
>> believe soJ).
>> I’m working on a virtualization project, and need to run Open MPI
>> on an unikernel OS (most of OFED is missing/unsupported).
>> Actually I’m only focusing on 1.10.2, which still has oob in
>> ompi. Probably it might be possible to make oob work there? Or
>> even for 1.10 branch (as Gilles metioned)?
>> Do you have any clue about the bug in oob back then?
>> Regards,
>> Shiqing
>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf
>> Of*r...@open-mpi.org 
>> *Sent:*Thursday, April 20, 2017 3:49 PM
>> *To:*OpenMPI Devel
>> *Subject:*Re: [OMPI devel] openib oob module
>> Hi Shiqing!
>> Been a long time - hope you are doing well.
>> I see no way to bring the oob module back now that the BTLs are
>> in the OPAL layer - this is why it was removed as the oob is in
>> ORTE, and 

[OMPI devel] rpm filename (was: Program which runs wih 1.8.3, fails with 2.0.2)

2017-04-21 Thread Jeff Squyres (jsquyres)
On Apr 19, 2017, at 10:12 PM, Kevin Buckley 
 wrote:
> 
> This observation may not belong here, but as there are some eyes on this
> issue, I might as well raise it here, as I came across it in the wake of going
> with Choice 2
> 
> If one wishes to take a nightly tarball and try and use the
> existing version's SRPM build infrastructure to create an RPM,
> then you will fall foul of the dashes in the nightly tarball
> names.
> 
> That is, if you try putting in
> 
> Version: v2.0.x-201704190318-24b5b83
> 
> instead of the original
> 
> Version: 2.0.2
> 
> you'll be told, when coming to do an rpmbuild, (your line may differ)
> 
> error: line 188: Illegal char '-' in: Version: v2.0.x-201704190318-24b5b83
> 
> You can get round this by
> 
>  1) Un-tar-ing the nightly tarball
>  2) Renaming it so that the dashes become dots
>  3) Recreating the newly named nightly tarball
> 
> if you then use
> 
> Version: v2.0.x.201704190318.24b5b83
> 
> the rpmbuild runs (as in it's running as I write !).
> 
> As to whether that observation might inform a different convention
> for naming the nightly tarballs, that is left up to the real developers.

Doh!  That's certainly an unintended consequence.  Thanks for reporting.

I'll open an issue on this.  We have several nightly scripts and whatnot that 
use this version, so I don't want to just make this change without checking a 
few other things first.

-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] openib oob module

2017-04-21 Thread Shiqing Fan
Thanks Gilles, I will try it out today and let you know if it's working for me 
or not.

Regards,
Shiqing

-Original Message-
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Friday, April 21, 2017 9:41 AM
To: devel@lists.open-mpi.org
Subject: Re: [OMPI devel] openib oob module

Folks,


fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for me on 
a mlx4 cluster (Mellanox QDR)


Cheers,


Gilles


On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
> I’m not seeing any problem inside the OOB - the problem appears to be 
> in the info being given to it:
>
> [host1:16244] 1 more process has sent help message 
> help-mpi-btl-openib.txt / default subnet prefix
> [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
> all help / error messages
> [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 
> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
> status number 12 for wr_id 112db80 opcode 32767  vendor error 129 qp_idx 0
>
> I’ve been searching, and I don’t see that help message anywhere in 
> your output - not sure what happened to it. I do see this in your 
> output - don’t know what it means:
>
> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] 
> !
>
>
>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan > > wrote:
>>
>> Forgot to enable oob verbose in my last test. Here is the updated 
>> output file.
>> Thanks,
>> Shiqing
>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf 
>> Of*r...@open-mpi.org 
>> *Sent:*Thursday, April 20, 2017 4:29 PM
>> *To:*OpenMPI Devel
>> *Subject:*Re: [OMPI devel] openib oob module
>> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. 
>> Should be able to restore it. I honestly don’t recall the bug, though :-(
>> If you want to try reviving it, you can add some debug in there (plus 
>> turn on the OOB verbosity) and I’m happy to help you figure it out.
>> Ralph
>>
>> On Apr 20, 2017, at 7:13 AM, Shiqing Fan > > wrote:
>> Hi Ralph,
>> Yes, it’s been a long time. Hope you all are doing well (I
>> believe soJ).
>> I’m working on a virtualization project, and need to run Open MPI
>> on an unikernel OS (most of OFED is missing/unsupported).
>> Actually I’m only focusing on 1.10.2, which still has oob in
>> ompi. Probably it might be possible to make oob work there? Or
>> even for 1.10 branch (as Gilles metioned)?
>> Do you have any clue about the bug in oob back then?
>> Regards,
>> Shiqing
>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf
>> Of*r...@open-mpi.org 
>> *Sent:*Thursday, April 20, 2017 3:49 PM
>> *To:*OpenMPI Devel
>> *Subject:*Re: [OMPI devel] openib oob module
>> Hi Shiqing!
>> Been a long time - hope you are doing well.
>> I see no way to bring the oob module back now that the BTLs are
>> in the OPAL layer - this is why it was removed as the oob is in
>> ORTE, and thus not accessible from OPAL.
>> Ralph
>>
>> On Apr 20, 2017, at 6:02 AM, Shiqing Fan
>> mailto:shiqing@huawei.com>> wrote:
>> Dear all,
>> I noticed that openib oob module has been removed since a
>> long time ago, because it wasn’t working anymore and nobody
>> seemed need it.
>> But for some special operating system, where the rdmacm, udcm
>> or ibcm kernel support is missing, oob may still be necessary.
>> I’m curious if it’s possible to bring this module back? How
>> difficult would it be to fix the bug in order to make it work
>> again in 1.10 branch or later? Thanks a lot.
>> Best Regards,
>> Shiqing
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo

Re: [OMPI devel] openib oob module

2017-04-21 Thread Shiqing Fan
The last message was from my test output, it makes no sense anyway.

It looks like some QP/CQ initialization problem, but it’s hard to find the 
exact place at momemnt. I will try Gilles’ patch and see if it’s working for me.

PS: Actually I made the patch from 1.10 series when OOB was removed. Gilles’s 
patch was made from 1.6.x which worked for me too.


Thanks,
Shiqing

From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Thursday, April 20, 2017 6:32 PM
To: OpenMPI Devel
Subject: Re: [OMPI devel] openib oob module

I’m not seeing any problem inside the OOB - the problem appears to be in the 
info being given to it:

[host1:16244] 1 more process has sent help message help-mpi-btl-openib.txt / 
default subnet prefix
[host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
/ error messages
[[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 
192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status number 
12 for wr_id 112db80 opcode 32767  vendor error 129 qp_idx 0

I’ve been searching, and I don’t see that help message anywhere in your output 
- not sure what happened to it. I do see this in your output - don’t know what 
it means:

[host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] 
!


On Apr 20, 2017, at 8:36 AM, Shiqing Fan 
mailto:shiqing@huawei.com>> wrote:

Forgot to enable oob verbose in my last test. Here is the updated output file.

Thanks,
Shiqing

From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Thursday, April 20, 2017 4:29 PM
To: OpenMPI Devel
Subject: Re: [OMPI devel] openib oob module

Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. Should be able 
to restore it. I honestly don’t recall the bug, though :-(

If you want to try reviving it, you can add some debug in there (plus turn on 
the OOB verbosity) and I’m happy to help you figure it out.
Ralph

On Apr 20, 2017, at 7:13 AM, Shiqing Fan 
mailto:shiqing@huawei.com>> wrote:

Hi Ralph,

Yes, it’s been a long time. Hope you all are doing well (I believe so ☺ ).

I’m working on a virtualization project, and need to run Open MPI on an 
unikernel OS (most of OFED is missing/unsupported).

Actually I’m only focusing on 1.10.2, which still has oob in ompi. Probably it 
might be possible to make oob work there? Or even for 1.10 branch (as Gilles 
metioned)?
Do you have any clue about the bug in oob back then?

Regards,
Shiqing


From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Thursday, April 20, 2017 3:49 PM
To: OpenMPI Devel
Subject: Re: [OMPI devel] openib oob module

Hi Shiqing!

Been a long time - hope you are doing well.

I see no way to bring the oob module back now that the BTLs are in the OPAL 
layer - this is why it was removed as the oob is in ORTE, and thus not 
accessible from OPAL.
Ralph

On Apr 20, 2017, at 6:02 AM, Shiqing Fan 
mailto:shiqing@huawei.com>> wrote:

Dear all,

I noticed that openib oob module has been removed since a long time ago, 
because it wasn’t working anymore and nobody seemed need it.
But for some special operating system, where the rdmacm, udcm or ibcm kernel 
support is missing, oob may still be necessary.

I’m curious if it’s possible to bring this module back? How difficult would it 
be to fix the bug in order to make it work again in 1.10 branch or later? 
Thanks a lot.

Best Regards,
Shiqing
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] openib oob module

2017-04-21 Thread Gilles Gouaillardet

Folks,


fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for 
me on a mlx4 cluster (Mellanox QDR)



Cheers,


Gilles


On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
I’m not seeing any problem inside the OOB - the problem appears to be 
in the info being given to it:


[host1:16244] 1 more process has sent help message 
help-mpi-btl-openib.txt / default subnet prefix
[host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages
[[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: 
192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR 
status number 12 for wr_id 112db80 opcode 32767  vendor error 129 qp_idx 0


I’ve been searching, and I don’t see that help message anywhere in 
your output - not sure what happened to it. I do see this in your 
output - don’t know what it means:


[host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] 
!



On Apr 20, 2017, at 8:36 AM, Shiqing Fan > wrote:


Forgot to enable oob verbose in my last test. Here is the updated 
output file.

Thanks,
Shiqing
*From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf 
Of*r...@open-mpi.org 

*Sent:*Thursday, April 20, 2017 4:29 PM
*To:*OpenMPI Devel
*Subject:*Re: [OMPI devel] openib oob module
Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. 
Should be able to restore it. I honestly don’t recall the bug, though :-(
If you want to try reviving it, you can add some debug in there (plus 
turn on the OOB verbosity) and I’m happy to help you figure it out.

Ralph

On Apr 20, 2017, at 7:13 AM, Shiqing Fan mailto:shiqing@huawei.com>> wrote:
Hi Ralph,
Yes, it’s been a long time. Hope you all are doing well (I
believe soJ).
I’m working on a virtualization project, and need to run Open MPI
on an unikernel OS (most of OFED is missing/unsupported).
Actually I’m only focusing on 1.10.2, which still has oob in
ompi. Probably it might be possible to make oob work there? Or
even for 1.10 branch (as Gilles metioned)?
Do you have any clue about the bug in oob back then?
Regards,
Shiqing
*From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf
Of*r...@open-mpi.org 
*Sent:*Thursday, April 20, 2017 3:49 PM
*To:*OpenMPI Devel
*Subject:*Re: [OMPI devel] openib oob module
Hi Shiqing!
Been a long time - hope you are doing well.
I see no way to bring the oob module back now that the BTLs are
in the OPAL layer - this is why it was removed as the oob is in
ORTE, and thus not accessible from OPAL.
Ralph

On Apr 20, 2017, at 6:02 AM, Shiqing Fan
mailto:shiqing@huawei.com>> wrote:
Dear all,
I noticed that openib oob module has been removed since a
long time ago, because it wasn’t working anymore and nobody
seemed need it.
But for some special operating system, where the rdmacm, udcm
or ibcm kernel support is missing, oob may still be necessary.
I’m curious if it’s possible to bring this module back? How
difficult would it be to fix the bug in order to make it work
again in 1.10 branch or later? Thanks a lot.
Best Regards,
Shiqing
___
devel mailing list
devel@lists.open-mpi.org 
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org 
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org 
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel




___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel