Folks,
fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for
me on a mlx4 cluster (Mellanox QDR)
Cheers,
Gilles
On 4/21/2017 1:31 AM, r...@open-mpi.org wrote:
I’m not seeing any problem inside the OOB - the problem appears to be
in the info being given to it:
[host1:16244] 1 more process has sent help message
help-mpi-btl-openib.txt / default subnet prefix
[host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
[[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to:
192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR
status number 12 for wr_id 112db80 opcode 32767 vendor error 129 qp_idx 0
I’ve been searching, and I don’t see that help message anywhere in
your output - not sure what happened to it. I do see this in your
output - don’t know what it means:
[host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb]
!!!!!!!!!!!!!!!!!!!!!!!!!
On Apr 20, 2017, at 8:36 AM, Shiqing Fan <shiqing....@huawei.com
<mailto:shiqing....@huawei.com>> wrote:
Forgot to enable oob verbose in my last test. Here is the updated
output file.
Thanks,
Shiqing
*From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf
Of*r...@open-mpi.org <mailto:r...@open-mpi.org>
*Sent:*Thursday, April 20, 2017 4:29 PM
*To:*OpenMPI Devel
*Subject:*Re: [OMPI devel] openib oob module
Yeah, I forgot that the 1.10 series still had the BTLs in OMPI.
Should be able to restore it. I honestly don’t recall the bug, though :-(
If you want to try reviving it, you can add some debug in there (plus
turn on the OOB verbosity) and I’m happy to help you figure it out.
Ralph
On Apr 20, 2017, at 7:13 AM, Shiqing Fan <shiqing....@huawei.com
<mailto:shiqing....@huawei.com>> wrote:
Hi Ralph,
Yes, it’s been a long time. Hope you all are doing well (I
believe soJ).
I’m working on a virtualization project, and need to run Open MPI
on an unikernel OS (most of OFED is missing/unsupported).
Actually I’m only focusing on 1.10.2, which still has oob in
ompi. Probably it might be possible to make oob work there? Or
even for 1.10 branch (as Gilles metioned)?
Do you have any clue about the bug in oob back then?
Regards,
Shiqing
*From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf
Of*r...@open-mpi.org <mailto:r...@open-mpi.org>
*Sent:*Thursday, April 20, 2017 3:49 PM
*To:*OpenMPI Devel
*Subject:*Re: [OMPI devel] openib oob module
Hi Shiqing!
Been a long time - hope you are doing well.
I see no way to bring the oob module back now that the BTLs are
in the OPAL layer - this is why it was removed as the oob is in
ORTE, and thus not accessible from OPAL.
Ralph
On Apr 20, 2017, at 6:02 AM, Shiqing Fan
<shiqing....@huawei.com <mailto:shiqing....@huawei.com>> wrote:
Dear all,
I noticed that openib oob module has been removed since a
long time ago, because it wasn’t working anymore and nobody
seemed need it.
But for some special operating system, where the rdmacm, udcm
or ibcm kernel support is missing, oob may still be necessary.
I’m curious if it’s possible to bring this module back? How
difficult would it be to fix the bug in order to make it work
again in 1.10 branch or later? Thanks a lot.
Best Regards,
Shiqing
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<output.txt>_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel