On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote: > Hi Ashley > > Thanks! I would definitely be interested and will look at the tool. > Meantime, I have filed a bunch of data on this in ticket #1944, so > perhaps you might take a glance at that and offer some thoughts? > > https://svn.open-mpi.org/trac/ompi/ticket/1944 > > Will be back after I look at the tool.
Have you made any progress? Whilst the fact that it appears to only happen on your machine implies it's not a general problem with OpenMPI the fact that it happens in the same location/rep count every time does swing the blame back the other way. Perhaps it's some special configure or runtime option you are setting? One thing that springs to mind is the numa-maps could be exposing some timimg problem with shared memory calls however this doesn't sit well with it always failing on the same iteration. Can you provide stack traces from when it's hung and crucially are they the same for every hang? If you change the coll_sync_barrier_before value to make it hang on a different repetition does this change the stack trace at all? Likewise when you have applied the collectives patch is the collective state the same for every hang and how does this differ when you change the coll_sync_barrier_before variable? It would be useful to see stack traces and collective state from the three collectives you report as causing problems, MPI_Bcast, MPI_Reduce and MPI_Allgather because as I said before these three collectives have radically different communication patterns. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk