Hi all,

We've been working trying to track down an IB issue here where a
user was having code (Gromacs, run with OMPI 1.4.5) dieing with:

[[18115,1],2][btl_openib_component.c:3224:handle_wc] from bruce030 to: bruce130 
error polling LP CQ with status 
RETRY EXCEEDED ERROR status number 12 for wr_id 7406080 opcode 0 vendor error 
129 qp_idx 2

The odd thing I've spotted though is that in the error it says:

* btl_openib_ib_retry_count - The number of times the sender will attempt to 
retry (defaulted to 7, the maximum 
value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10).

Those don't match the values compiled into OMPI 1.4.5:

ompi_info -a | egrep 'btl_openib_ib_min_rnr_timer|btl_openib_ib_timeout'
 MCA btl: parameter "btl_openib_ib_min_rnr_timer" (current value: "25",
data source: default value)
 MCA btl: parameter "btl_openib_ib_timeout" (current value: "20", data
source: default value)

It looks like the file:

 ompi/mca/btl/openib/help-mpi-btl-openib.txt

needs to be updated with the correct values.

We're stuck on 1.4 for the forseable future (too many apps to
recompile) so I don't know if 1.5+ has the same issue.

It's been there since at least 2009.. :-)

http://www.open-mpi.org/community/lists/users/2009/03/8600.php

cheers!
Chris
-- 
   Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

Reply via email to