I've tried this out, and got the same problem as I sent before.
With the same configuration and command line, 1.6.5 works for me, 1.10 series
seem not.
Could it also be IB configuration issue? (ib_write/read_bw/lat work fine across
the two nodes)
Error output below:
[[39776,1],0][btl_openib_component.c:3502:handle_wc] from vrdma-host1 to:
192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR status number
12 for wr_id 2318d80 opcode 32767 vendor error 129 qp_idx 0
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 20). The actual timeout value used is calculated as:
4.096 microseconds * (2^btl_openib_ib_timeout)
See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: host1
Local device: mlx4_0
Peer host: 192.168.2.22
You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
-----Original Message-----
From: devel [mailto:[email protected]] On Behalf Of Gilles
Gouaillardet
Sent: Friday, April 21, 2017 9:41 AM
To: [email protected]
Subject: Re: [OMPI devel] openib oob module
Folks,
fwiw, i made https://github.com/open-mpi/ompi/pull/3393 and it works for me on
a mlx4 cluster (Mellanox QDR)
Cheers,
Gilles
On 4/21/2017 1:31 AM, [email protected] wrote:
> I’m not seeing any problem inside the OOB - the problem appears to be
> in the info being given to it:
>
> [host1:16244] 1 more process has sent help message
> help-mpi-btl-openib.txt / default subnet prefix
> [host1:16244] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> [[46697,1],0][btl_openib_component.c:3501:handle_wc] from host1 to:
> 192.168.2.22 error polling LP CQ with status RETRY EXCEEDED ERROR
> status number 12 for wr_id 112db80 opcode 32767 vendor error 129 qp_idx 0
>
> I’ve been searching, and I don’t see that help message anywhere in
> your output - not sure what happened to it. I do see this in your
> output - don’t know what it means:
>
> [host1][[46697,1],0][connect/btl_openib_connect_oob.c:935:rml_recv_cb]
> !!!!!!!!!!!!!!!!!!!!!!!!!
>
>
>> On Apr 20, 2017, at 8:36 AM, Shiqing Fan <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Forgot to enable oob verbose in my last test. Here is the updated
>> output file.
>> Thanks,
>> Shiqing
>> *From:*devel [mailto:[email protected]]*On Behalf
>> Of*[email protected] <mailto:[email protected]>
>> *Sent:*Thursday, April 20, 2017 4:29 PM
>> *To:*OpenMPI Devel
>> *Subject:*Re: [OMPI devel] openib oob module
>> Yeah, I forgot that the 1.10 series still had the BTLs in OMPI.
>> Should be able to restore it. I honestly don’t recall the bug, though :-(
>> If you want to try reviving it, you can add some debug in there (plus
>> turn on the OOB verbosity) and I’m happy to help you figure it out.
>> Ralph
>>
>> On Apr 20, 2017, at 7:13 AM, Shiqing Fan <[email protected]
>> <mailto:[email protected]>> wrote:
>> Hi Ralph,
>> Yes, it’s been a long time. Hope you all are doing well (I
>> believe soJ).
>> I’m working on a virtualization project, and need to run Open MPI
>> on an unikernel OS (most of OFED is missing/unsupported).
>> Actually I’m only focusing on 1.10.2, which still has oob in
>> ompi. Probably it might be possible to make oob work there? Or
>> even for 1.10 branch (as Gilles metioned)?
>> Do you have any clue about the bug in oob back then?
>> Regards,
>> Shiqing
>> *From:*devel [mailto:[email protected]]*On Behalf
>> Of*[email protected] <mailto:[email protected]>
>> *Sent:*Thursday, April 20, 2017 3:49 PM
>> *To:*OpenMPI Devel
>> *Subject:*Re: [OMPI devel] openib oob module
>> Hi Shiqing!
>> Been a long time - hope you are doing well.
>> I see no way to bring the oob module back now that the BTLs are
>> in the OPAL layer - this is why it was removed as the oob is in
>> ORTE, and thus not accessible from OPAL.
>> Ralph
>>
>> On Apr 20, 2017, at 6:02 AM, Shiqing Fan
>> <[email protected] <mailto:[email protected]>> wrote:
>> Dear all,
>> I noticed that openib oob module has been removed since a
>> long time ago, because it wasn’t working anymore and nobody
>> seemed need it.
>> But for some special operating system, where the rdmacm, udcm
>> or ibcm kernel support is missing, oob may still be necessary.
>> I’m curious if it’s possible to bring this module back? How
>> difficult would it be to fix the bug in order to make it work
>> again in 1.10 branch or later? Thanks a lot.
>> Best Regards,
>> Shiqing
>> _______________________________________________
>> devel mailing list
>> [email protected] <mailto:[email protected]>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> devel mailing list
>> [email protected] <mailto:[email protected]>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> <output.txt>_______________________________________________
>> devel mailing list
>> [email protected] <mailto:[email protected]>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel