Hi,

recently I've discovered a strange bug, which occurs when you try to communicate within mca_coll_*_comm_query() or mca_coll_*_module_init(). The interesting thing is that it only fails for larger communicators. Until now, I wasn't sure if this is a problem of my own collective component, or a bug in OpenMPI. Since I've found a case where it fails even without my component, I'm convinced that I shouldn't hunt it alone. ;-)

$ mpiexec -np 8 ... --mca coll_hierarch_priority 50 any_app
# runs ok
$ mpiexec -np 50 ... --mca coll_hierarch_priority 50 any_app
[0,1,0][../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:622:mca_btl_tcp_component_recv_handler] errno=11
mpiexec: killing job...

Kind regards,
  Christian


Reply via email to