Hi,
recently I've discovered a strange bug, which occurs when you try to
communicate within mca_coll_*_comm_query() or mca_coll_*_module_init().
The interesting thing is that it only fails for larger communicators.
Until now, I wasn't sure if this is a problem of my own collective
component, or a bug in OpenMPI. Since I've found a case where it fails
even without my component, I'm convinced that I shouldn't hunt it
alone. ;-)
$ mpiexec -np 8 ... --mca coll_hierarch_priority 50 any_app
# runs ok
$ mpiexec -np 50 ... --mca coll_hierarch_priority 50 any_app
[0,1,0][../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:622:mca_btl_tcp_component_recv_handler]
errno=11
mpiexec: killing job...
Kind regards,
Christian