No problem on our end with MatSetFailure() being collective... we can guarantee that easily.
Thanks for working on this! This has been a high priority feature request for a while. Derek On Thu, Feb 19, 2015 at 3:41 PM Barry Smith <[email protected]> wrote: > > > On Feb 19, 2015, at 2:06 PM, Dmitry Karpeyev <[email protected]> > wrote: > > > > > > > > On Thu Feb 19 2015 at 12:59:12 PM Barry Smith <[email protected]> > wrote: > > > > > On Feb 19, 2015, at 1:56 PM, Dmitry Karpeyev <[email protected]> > wrote: > > > > > > > > > > > > On Thu Feb 19 2015 at 12:41:59 PM Barry Smith <[email protected]> > wrote: > > > > > > Yeah, that sounds like a good fix, except for this: we have to make > sure all ranks diverge with this failure so that the user can retry the > solve, if necessary. > > > > This is the business of the person calling MatSetFailure(). We can > require that this routine be a collective. > > Yes, but this way any "resilient" code is required to carry out an extra > reduction every KSP iteration, and we are not giving them an opportunity to > coalesce it with anything else. > > For each Krylov method the "point" in the code where the coalescence > can take place is different and with methods like pipelined GMRES the point > where the coalescence can be done is AFTER the point where the results of > the (failed) multiply are used so coding can be tricky. For now I won't > worry about coalescence and would just require MatSetFailure() to > collective. We need more experience before we try to be fancy. > > If you do any coding use my branch barry/use-ksp_matmult_pcapply- > in-ksp-methods > > Barry > > > > > In any event, I can put this MatSetFailure()/MatSetFailureCollective() > in. > > Dmitry. > > > > Barry > > > > > That would require an extra reduction every KSP iteration. > > > With Inf or NaN we could piggyback on the norm computation. > > > > > > Dmitry. > > > > > > Barry > > > > > > > > > > > > > > > > On Feb 19, 2015, at 9:33 AM, Dmitry Karpeyev <[email protected]> > wrote: > > > > > > > > I wanted to revive this thread and move it to petsc-dev. This > problem seems to be harder than I realized. > > > > > > > > Suppose MatMult inside KSPSolve() inside SNESSolve() cannot compute > a valid output vector. > > > > For example, it's a MatMFFD and as part of its function evaluation > it has to evaluate an implicitly-defined > > > > constitutive model (e.g., solve an equation of state) and this inner > solve diverges > > > > (e.g., the time step is too big). I want to be able to abort the > linear > > > > solve and the nonlinear solve, return a suitable "converged" reason > and let the user retry, maybe with a > > > > different timestep size. This is for a hand-rolled time stepper, > but TS would face similar issues. > > > > > > > > Based on the previous thread here http://lists.mcs.anl.gov/ > pipermail/petsc-users/2014-August/022597.html > > > > I tried marking the result of MatMult as "invalid" and let it > propagate up to KSPSolve() where it can be handled. > > > > This quickly gets out of control, since the invalid Vec isn't > returned to the KSP immediately. It could be a work > > > > vector, which is fed into PCApply() along various code paths, > depending on the side of the preconditioner, whether it's a > > > > transpose solve, etc. Each of these transformations (e.g., > PCApply()) would then have to check the validity of > > > > the input argument, clear its error condition and set it on the > output argument, etc. Very error-prone and fragile. > > > > Not to mention the large amount of code to sift through. > > > > > > > > This is a general problem of exception handling -- we want to > "unwind" the stack to the point where the problem should > > > > be handled, but there doesn't seem to a good way to do it. We also > want to be able to clear all of the error conditions > > > > on the way up (e.g., mark vectors as valid again, but not too > early), otherwise we leave the solver in an invalid state. > > > > > > > > > > > > Instead of passing an exception condition up the stack I could try > storing that condition in one of the more globally-visible > > > > objects (e.g., the Mat), but if the error occurs inside the > evaluation of the residual that's being differenced, it doesn't really > > > > have access to the Mat. This probably raises various thread safety > issues as well. > > > > > > > > Using SNESSetFunctionDomainError() doesn't seem to be a solution: a > MatMFFD created with MatCreateSNESMF() > > > > has a pointer to SNES, but the function evaluation code actually has > no clue about that. More generally, I don't > > > > know whether we want to wait for the linear solve to complete before > handling this exception: it is unnecessary, > > > > it might be an expensive linear solve and the result of such a > KSPSolve() is probably undefined and might blow up in > > > > unexpected ways. I suppose if there is a way to get a hold of SNES, > each subsequent MatMult_MFFD has to check > > > > whether the domain error is set and return early in that case? We > would still have to wait for the linear solve to grind > > > > through the rest of its iterations. I don't know, however, if > there is a good way to guarantee that linear solver will get > > > > through this quickly and without unintended consequences. Should > MatMFFD also get a hold of the KSP and set a flag > > > > there to abort? I still don't know what the intervening code (e.g., > the various PCApply()) will do before the KSP has a > > > > chance to deal with this. > > > > > > > > I'm now thinking that setting some vector entries to NaN might be a > good solution: I hope this NaN will propagate all the > > > > way up through the subsequent arithmetic operations (does the IEEE > floating-point arithmetic guarantees?), this "error > > > > condition" gets automatically cleared the next time the vector is > recomputed, since its values are reset. Finally, I want > > > > this exception to be detected globally but without incurring an > extra reduction every time the residual is evaluated, > > > > and NaN will be show up in the norm that (most) KSPs would compute > anyway. That way KSP could diverge with a > > > > KSP_DIVERGED_NAN or a similar reason and the user would have an > option to retry. The problem with this approach > > > > is that VecValidEntries() in MatMult() and PCApply() will throw an > error before this can work, so I'm trying to think about > > > > good ways of turning it off. Any ideas about how to do this? > > > > > > > > Incidentally, I realized that I don't understand how > SNESFunctionDomainError can be handled gracefully in the current > > > > set up: it's not set or checked collectively, so there isn't a good > way to abort and retry across the whole comm, is there? > > > > > > > > Dmitry. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun Aug 31 2014 at 10:12:53 PM Jed Brown <[email protected]> > wrote: > > > > Dmitry Karpeyev <[email protected]> writes: > > > > > > > > > Handling this at the KSP level (I actually think the Mat level is > more > > > > > appropriate, since the operator, not the solver, knows its domain), > > > > > > > > We are dynamically discovering the domain, but I don't think it's > > > > appropriate for Mat to refuse to evaluate any more matrix actions > until > > > > some action is taken at the MatMFFD/SNESMF level. Marking the Vec > > > > invalid is fine, but some action needs to be taken and if Derek wants > > > > the SNES to skip further evaluations, we need to propagate the > > > > information up the stack somehow. > > > > > > >
