I’m not familiar with the openib code, but this looks to me like it may be caused by a change in the openib code itself. Have you looked to see what the diff might be between the two versions?
> On Apr 21, 2017, at 6:45 AM, Shiqing Fan <shiqing....@huawei.com> wrote: > > I've tried this out, and got the same problem as I sent before. > > With the same configuration and command line, 1.6.5 works for me, 1.10 series > seem not. > > Could it also be IB configuration issue? (ib_write/read_bw/lat work fine > across the two nodes) > > Error output below: > > [[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 to: > 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status > number 12 for wr_id 2318d80 opcode 32767 vendor error 129 qp_idx 0 > > -------------------------------------------------------------------------- > The InfiniBand retry count between two MPI processes has been > exceeded. "Retry count" is defined in the InfiniBand spec 1.2 > (section 12.7.38): > > The total number of times that the sender wishes the receiver to > retry timeout, packet sequence, etc. errors before posting a > completion error. > > This error typically means that there is something awry within the > InfiniBand fabric itself. You should note the hosts on which this > error has occurred; it has been observed that rebooting or removing a > particular host from the job can sometimes resolve this issue. > > Two MCA parameters can be used to control Open MPI's behavior with > respect to the retry count: > > * btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 20). The actual timeout value used is calculated as: > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. > > Below is some information about the host that raised the error and the > peer to which it was connected: > > Local host: host1 > Local device: mlx4_0 > Peer host: 192.168.2.22 > > You may need to consult with your system administrator to get this > problem fixed. > -------------------------------------------------------------------------- > > -----Original Message----- > From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles > Gouaillardet > Sent: Friday, April 21, 2017 9:41 AM > To: devel@lists.open-mpi.org > Subject: Re: [OMPI devel] openib oob module > > Folks, > > > fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for me > on a mlx4 cluster (Mellanox QDR) > > > Cheers, > > > Gilles > > > On 4/21/2017 1:31 AM, r...@open-mpi.org wrote: >> I’m not seeing any problem inside the OOB - the problem appears to be >> in the info being given to it: >> >> [host1:16244] 1 more process has sent help message >> help-mpi-btl-openib.txt / default subnet prefix >> [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to: >> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR >> status number 12 for wr_id 112db80 opcode 32767 vendor error 129 qp_idx 0 >> >> I’ve been searching, and I don’t see that help message anywhere in >> your output - not sure what happened to it. I do see this in your >> output - don’t know what it means: >> >> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb] >> !!!!!!!!!!!!!!!!!!!!!!!!! >> >> >>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan <shiqing....@huawei.com >>> <mailto:shiqing....@huawei.com>> wrote: >>> >>> Forgot to enable oob verbose in my last test. Here is the updated >>> output file. >>> Thanks, >>> Shiqing >>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf >>> Of*r...@open-mpi.org <mailto:r...@open-mpi.org> >>> *Sent:*Thursday, April 20, 2017 4:29 PM >>> *To:*OpenMPI Devel >>> *Subject:*Re: [OMPI devel] openib oob module >>> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI. >>> Should be able to restore it. I honestly don’t recall the bug, though :-( >>> If you want to try reviving it, you can add some debug in there (plus >>> turn on the OOB verbosity) and I’m happy to help you figure it out. >>> Ralph >>> >>> On Apr 20, 2017, at 7:13 AM, Shiqing Fan <shiqing....@huawei.com >>> <mailto:shiqing....@huawei.com>> wrote: >>> Hi Ralph, >>> Yes, it’s been a long time. Hope you all are doing well (I >>> believe soJ). >>> I’m working on a virtualization project, and need to run Open MPI >>> on an unikernel OS (most of OFED is missing/unsupported). >>> Actually I’m only focusing on 1.10.2, which still has oob in >>> ompi. Probably it might be possible to make oob work there? Or >>> even for 1.10 branch (as Gilles metioned)? >>> Do you have any clue about the bug in oob back then? >>> Regards, >>> Shiqing >>> *From:*devel [mailto:devel-boun...@lists.open-mpi.org]*On Behalf >>> Of*r...@open-mpi.org <mailto:r...@open-mpi.org> >>> *Sent:*Thursday, April 20, 2017 3:49 PM >>> *To:*OpenMPI Devel >>> *Subject:*Re: [OMPI devel] openib oob module >>> Hi Shiqing! >>> Been a long time - hope you are doing well. >>> I see no way to bring the oob module back now that the BTLs are >>> in the OPAL layer - this is why it was removed as the oob is in >>> ORTE, and thus not accessible from OPAL. >>> Ralph >>> >>> On Apr 20, 2017, at 6:02 AM, Shiqing Fan >>> <shiqing....@huawei.com <mailto:shiqing....@huawei.com>> wrote: >>> Dear all, >>> I noticed that openib oob module has been removed since a >>> long time ago, because it wasn’t working anymore and nobody >>> seemed need it. >>> But for some special operating system, where the rdmacm, udcm >>> or ibcm kernel support is missing, oob may still be necessary. >>> I’m curious if it’s possible to bring this module back? How >>> difficult would it be to fix the bug in order to make it work >>> again in 1.10 branch or later? Thanks a lot. >>> Best Regards, >>> Shiqing >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> <output.txt>_______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel