This is an attempt to merge the handling of user domain errors that Dmitry 
brought up (for example in Jacobian free matrix-vector products) and the 
handling of, for example, zero pivot in an LU factorization brought up by Ed in 
a simple consistent way.

  I have a new branch barry/propagate-pcsetup-failures that attempts to allow 
propagating errors up from lower levels of KSP solves, PC setup, PC applies etc 
using Nan/Inf. (Note Dmitry I am not doing the matrix-free etc domain error 
setting that you proposed to do, that can be added at any time).

   1) This simplifies the needed code since we won't need to put checks all 
over the place on returns about failure nor do we need to worry about 
propagating errors from one process to another (since the Nan/Inf get moved by 
the MPI_Allreduce()).

    2) I am propagating down the KSPSetErrorIfNotConverged() flag to all the 
inner solvers so if the user DOES want an immediate error stop they can get it 
by simply setting the flag at the highest level of KSP

    3) eventually we would like to propagate up not only the fact that an error 
happened but also information about the type of error. This I think we can do 
orthogonally to propagating up the FACT that we have an error with the Nan/Inf. 
In other words if an error is detected by a Nan/Inf norm or inner product then 
eventually the  code would be able query where the problem started, for example 
a zero pivot inside the coarse grid solve inside a multigrid inside a 
fieldsplit etc.

   Thoughts,

   Barry





> On Apr 29, 2015, at 10:11 PM, Barry Smith <[email protected]> wrote:
> 
> 
>  Indeed you proposed the exact thing. I would be happy if you tried to make a 
> branch of master that used this approach.
> 
>  Barry
> 
>> On Apr 29, 2015, at 9:28 PM, Dmitry Karpeyev <[email protected]> wrote:
>> 
>> Barry,
>> Sorry, I must have missed this -- I really ought to make a better filter for 
>> catching email like this.
>> I think using NaNs is an excellent solution, in fact, I was proposing it a 
>> few months ago here :-)
>> http://lists.mcs.anl.gov/pipermail/petsc-dev/2015-February/016958.html
>> It ensures that the error is collective (the norm reduction will ensure 
>> every rank gets a NaN), 
>> the "error condition" is cleared automatically on the next MatMult, etc.
>> I'm all for it.
>> Should I put it in?
>> 
>> Dmitry.
>> 
>> On Wed, Apr 29, 2015 at 8:26 PM Barry Smith <[email protected]> wrote:
>> 
>>  Dmitry,
>> 
>>    I haven't heard back from you on this. Any thoughts?
>> 
>>  Barry
>> 
>>> On Apr 20, 2015, at 6:23 PM, Barry Smith <[email protected]> wrote:
>>> 
>>> 
>>> Dmitry,
>>> 
>>>  Rather than introducing another whole complexity of flags for indicating 
>>> domain errors in user functions just do the following.
>>> 
>>>  1) just stick a Nan into the functions result
>>>  2) remove the VecValidValues() at the END of routines like MatMult()
>>>  3) when Nan or Inf pop up in Krylov methods (which will happen within 
>>> VecNorm or VecDot() and thus we get free collective knowledge of the 
>>> problem even if it happened on only one node), generate the appropriate 
>>> KSP_DIVERGED_NANORINF. This is already handled sometimes (most of the 
>>> time?), for example in KSPSolve_CG is code
>>> ierr = VecXDot(Z,R,&beta);CHKERRQ(ierr);         /*  beta <- z'*r       */
>>>   if (PetscIsInfOrNanScalar(beta)) {
>>>     if (ksp->errorifnotconverged) 
>>> SETERRQ(PetscObjectComm((PetscObject)ksp),PETSC_ERR_NOT_CONVERGED,"KSPSolve 
>>> has not converged due to Nan or Inf inner product");
>>>     else {
>>>       ksp->reason = KSP_DIVERGED_NANORINF;
>>>       PetscFunctionReturn(0);
>>>     }
>>>   }
>>> 
>>>  4) SNES already handles failed to converge KSP and
>>>  5 ) TS already handles failed to converged SNES; by, for example, cutting 
>>> the timestep.
>>> 
>>> Barry
>>> 
>>> 
>> 
> 

Reply via email to