Hi Folks,I have been seeing some nasty behaviour in collectives, particularly bcast and reduce. Attached is a reproducer (for bcast).
The code will rapidly slow to a crawl (usually interpreted as a hang in real applications) and sometimes gets killed with sigbus or sigterm.
I see this with openmpi-1.2.3 or openmpi-1.2.4 ofed 1.2 linux 2.6.19 + patches gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) 4 socket, dual core opterons run as mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hangTo my now uneducated eye it looks as if the root process is rushing ahead and not progressing earlier bcasts.
Anyone else seeing similar? Any ideas for workarounds? As a point of reference, mvapich2 0.9.8 works fine. Thanks, David
bcast-hang.c
Description: Binary data