Re: [OMPI devel] collective problems

Jeff Squyres Fri, 5 Oct 2007 03:43:46 -0400

David --

Gleb and I just actively re-looked at this problem yesterday; wethink it's related to https://svn.open-mpi.org/trac/ompi/ticket/1015. We previously thought this ticket was a different problem, butour analysis yesterday shows that it could be a real problem in theopenib BTL or ob1 PML (kinda think it's the openib btl because itdoesn't seem to happen on other networks, but who knows...).


Gleb is investigating.



On Oct 5, 2007, at 12:59 AM, David Daniel wrote:

Hi Folks,
I have been seeing some nasty behaviour in collectives,particularly bcast and reduce. Attached is a reproducer (for bcast).
The code will rapidly slow to a crawl (usually interpreted as ahang in real applications) and sometimes gets killed with sigbus orsigterm.
I see this with

  openmpi-1.2.3 or openmpi-1.2.4
  ofed 1.2
  linux 2.6.19 + patches
  gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
  4 socket, dual core opterons

run as

  mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
To my now uneducated eye it looks as if the root process is rushingahead and not progressing earlier bcasts.
Anyone else seeing similar?  Any ideas for workarounds?

As a point of reference, mvapich2 0.9.8 works fine.

Thanks, David


<bcast-hang.c>
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] collective problems

Reply via email to