Hi Folks,

I have been seeing some nasty behaviour in collectives, particularly bcast and reduce. Attached is a reproducer (for bcast).

The code will rapidly slow to a crawl (usually interpreted as a hang in real applications) and sometimes gets killed with sigbus or sigterm.

I see this with

  openmpi-1.2.3 or openmpi-1.2.4
  ofed 1.2
  linux 2.6.19 + patches
  gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
  4 socket, dual core opterons

run as

  mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang

To my now uneducated eye it looks as if the root process is rushing ahead and not progressing earlier bcasts.

Anyone else seeing similar?  Any ideas for workarounds?

As a point of reference, mvapich2 0.9.8 works fine.

Thanks, David



Attachment: bcast-hang.c
Description: Binary data

Reply via email to