[petsc-dev] errors galore related to Barry's change to PetscError

Barry Smith Sun, 9 May 2010 11:39:11 -0500

On May 9, 2010, at 8:47 AM, Jed Brown wrote:

> On Sat, 8 May 2010 17:42:33 -0500, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>    Now that a comm is passed into the error handler, how we use it is
>>    still preliminary and work and progress. Likely it will evolve as
>>    we figure out what do to.
>> 
>> The reason I don't have non-root call MPI_Abort() (even after waiting)
>> is that MPI_Abort() will trigger all the other processes to abort? If
>> the root is "late" getting to the error then it will receive an abort
>> from the non-root MPI_Abort() and never execute the traceback hence no
>> error message; bad news. At least I think this might happen.
> 
> Alternatively this happens:
> 
>  mpirun has exited due to process rank 1 with PID 4388 on
>  node kunyang exiting without calling "finalize". This may
>  have caused other processes in the application to be
>  terminated by signals sent by mpirun (as reported here).
> 
> But there is a crucial behavioral change.  The user used to be able to
> catch the error at any point in the chain and decide not to make it
> fatal.  This is no longer possible with the traceback error handler
> (which admittedly isn't the best handler for this handling mechanism).
> I realize that MPI (and thus PETSc) make no guarantees about the state
> after an error occurs, but they might be trying to write some checkpoint
> or release some resources, in which case abort() from the other ranks is
> not desirable.


   Alternatives, 

*have all the other processes return silently up the stack so they can be 
"recovered". Note,I have been tempted to rip out the current "exception 
handling" stuff I put in earlier. It is ugly and probably fragile

*one can provide more than one traceback error handler, for example one that is 
just like the traditional PETSc one. 



> 
>> An alternative to what I have done is to have non-root wait a while
>> and then return with the usual traceback. Thus under normal
>> circumstances it will receive the abort() from root before printing
>> the traceback so we will get one nice traceback from root. (will it?)
>> Under strange circumstances where root for some reason doesn't get to
>> the error we will get the current behavior where everyone else prints
>> the traceback and so we do get a useful error message (not perfect
>> cause there are several error messages but much better than no
>> messages.
> 
> I think this would be better.

   We can try this as the default and see how it works in practice.


   Barry

> 
> Jed

[petsc-dev] errors galore related to Barry's change to PetscError

Reply via email to