Hi all, We've been working trying to track down an IB issue here where a user was having code (Gromacs, run with OMPI 1.4.5) dieing with:
[[18115,1],2][btl_openib_component.c:3224:handle_wc] from bruce030 to: bruce130 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 7406080 opcode 0 vendor error 129 qp_idx 2 The odd thing I've spotted though is that in the error it says: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). Those don't match the values compiled into OMPI 1.4.5: ompi_info -a | egrep 'btl_openib_ib_min_rnr_timer|btl_openib_ib_timeout' MCA btl: parameter "btl_openib_ib_min_rnr_timer" (current value: "25", data source: default value) MCA btl: parameter "btl_openib_ib_timeout" (current value: "20", data source: default value) It looks like the file: ompi/mca/btl/openib/help-mpi-btl-openib.txt needs to be updated with the correct values. We're stuck on 1.4 for the forseable future (too many apps to recompile) so I don't know if 1.5+ has the same issue. It's been there since at least 2009.. :-) http://www.open-mpi.org/community/lists/users/2009/03/8600.php cheers! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/