On May 9, 2010, at 8:47 AM, Jed Brown wrote: > On Sat, 8 May 2010 17:42:33 -0500, Barry Smith <bsmith at mcs.anl.gov> wrote: >> Now that a comm is passed into the error handler, how we use it is >> still preliminary and work and progress. Likely it will evolve as >> we figure out what do to. >> >> The reason I don't have non-root call MPI_Abort() (even after waiting) >> is that MPI_Abort() will trigger all the other processes to abort? If >> the root is "late" getting to the error then it will receive an abort >> from the non-root MPI_Abort() and never execute the traceback hence no >> error message; bad news. At least I think this might happen. > > Alternatively this happens: > > mpirun has exited due to process rank 1 with PID 4388 on > node kunyang exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > > But there is a crucial behavioral change. The user used to be able to > catch the error at any point in the chain and decide not to make it > fatal. This is no longer possible with the traceback error handler > (which admittedly isn't the best handler for this handling mechanism). > I realize that MPI (and thus PETSc) make no guarantees about the state > after an error occurs, but they might be trying to write some checkpoint > or release some resources, in which case abort() from the other ranks is > not desirable.
Alternatives, *have all the other processes return silently up the stack so they can be "recovered". Note,I have been tempted to rip out the current "exception handling" stuff I put in earlier. It is ugly and probably fragile *one can provide more than one traceback error handler, for example one that is just like the traditional PETSc one. > >> An alternative to what I have done is to have non-root wait a while >> and then return with the usual traceback. Thus under normal >> circumstances it will receive the abort() from root before printing >> the traceback so we will get one nice traceback from root. (will it?) >> Under strange circumstances where root for some reason doesn't get to >> the error we will get the current behavior where everyone else prints >> the traceback and so we do get a useful error message (not perfect >> cause there are several error messages but much better than no >> messages. > > I think this would be better. We can try this as the default and see how it works in practice. Barry > > Jed
