On Tue, 2009-06-16 at 13:39 -0600, Bryan Lally wrote: > Ashley Pittman wrote: > > > Whilst the fact that it appears to only happen on your machine implies > > it's not a general problem with OpenMPI the fact that it happens in the > > same location/rep count every time does swing the blame back the other > > way. > > This sounds a _lot_ like the problem I was seeing, my initial message is > appended here. If it's the same thing, then it's not only on the big > machines here that Ralph was talking about, but on very vanilla Fedora 7 > and 9 boxes. > > I was able to hang Ralph's reproducer on an 8 core Dell, Fedora 9, > kernel 2.6.27(.4-78.2.53.fc9.x86_64). > > I don't think it's just the one machine and it's configuration.
Interesting. In Ralphs case the hangs I've seen are where the application calls Bcast but the MPI library calls barrier below this (it does this every 1000 collectives apparently), it could be that any call to Barrier at this point would hang or it could be something special about the subverted call which is causing the problem. Do you have a stack trace of your hung application to hand, in particular when you say "All processes have made the same call to MPI_Allreduce. The processes are all in opal_progress, called (with intervening calls) by MPI_Allreduce." do the intervening calls include mca_coll_sync_bcast ompi_coll_tuned_barrier_intra_dec_fixed and ompi_coll_tuned_barrier_intra_recursivedoubling? Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk