Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts?
https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool. Thanks again Ralph On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman <[email protected]>wrote: > > Ralph, > > If I may say this is exactly the type of problem the tool I have been > working on recently aims to help with and I'd be happy to help you > through it. > > Firstly I'd say of the three collectives you mention, MPI_Allgather, > MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one > and the last a many-to-one communication pattern. The scenario of a > root process falling behind and getting swamped in comms is a plausible > one for MPI_Reduce only but doesn't hold water with the other two. You > also don't mention if the loop is over a single collective or if you > have loop calling a number of different collectives each iteration. > > padb, the tool I've been working on has the ability to look at parallel > jobs and report on the state of collective comms and should help narrow > you down on erroneous processes and those simply blocked waiting for > comms. I'd recommend using it to look at maybe four or five instances > where the application has hung and look for any common features between > them. > > Let me know if you are willing to try this route and I'll talk, the code > is downloadable from http://padb.pittman.org.uk and if you want the full > collective functionality you'll need to patch openmp with the patch from > http://padb.pittman.org.uk/extensions.html > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
