On Mon, May 4, 2015 at 11:18 PM Barry Smith <[email protected]> wrote:
> > > On May 4, 2015, at 10:54 PM, Dmitry Karpeyev <[email protected]> wrote: > > > > > > > > On Mon, May 4, 2015 at 6:20 PM Barry Smith <[email protected]> wrote: > > > > My first reaction to this was "man that is ugly and cumbersome, I can > do it much cleaner than that"; turns out it isn't as simple as I thought > but with a couple of macros I think I've incorporated much of what is > needed in > > > > > https://bitbucket.org/petsc/petsc/pull-request/315/propagating-solver-errors-instead-of/diff > > > > some work needs to be done on getting the most appropriate SNES > converged reason set. In fact one could argue that trying to pass the > converged reason up as a single enum type may not be the best model since > there may be more information that one wishes to convey such as function > domain error that happened while differencing the function with coloring to > compute the Jacobian. > > Are you arguing for a more full-fledged exception handling? > > No. Actually the more full-fledged exception handling has to handle the > parallel collective issues which is tough. > Yes, we'd have to ensure that every rank raises the exception. That's why NaN/Inf norm/dot is so attractive. Maybe if we allowed only reductions to raise exceptions? > > > Note that you are essentially having to insert various custom "exception > condition" checks (e.g., SNESCheckKSPSolve(), if(ksp->reason) break; > KSPCheckDot(), etc) on the whole call path, along which an exception might > be propagating. This strikes me as brittle and error-prone, not to mention > threatening to get rather complex if the number of these exceptions and > their combinations starts to grow. > > Propose something better > > > > > Anyways in particular look at the test example ex69.c > > Looks pretty good. Thanks! > > > > > > Barry > > > > > On May 1, 2015, at 10:52 PM, Dmitry Karpeyev <[email protected]> > wrote: > > > > > > Here's the first crack at it: > https://bitbucket.org/petsc/petsc/branch/karpeev/ksp-diverged-on-matmult-nanorinf > . > > > Messier than I had expected (GMRES only for now). > > > > > > On Fri, May 1, 2015 at 8:06 PM Dmitry Karpeyev <[email protected]> > wrote: > > > On Fri, May 1, 2015 at 7:32 PM Barry Smith <[email protected]> wrote: > > > > > > > On May 1, 2015, at 6:43 PM, Jed Brown <[email protected]> wrote: > > > > > > > > Barry Smith <[email protected]> writes: > > > >> 1) This simplifies the needed code since we won't need to put > > > >> checks all over the place on returns about failure nor do we need > > > >> to worry about propagating errors from one process to another > > > >> (since the Nan/Inf get moved by the MPI_Allreduce()). > > > > > > > > My concern is that -fp_trap will become a lot less useful. > > > > > > I agree there is a tradeoff; but under "normal" circumstances where > there are no Nan or Inf around (which I think is most of the time) -fp_trap > will be just as useful as now. For the other cases the user will have to > have some idea where (and when) in the code to turn on the trapping to > catch the "true" problems. > > > > > > Barry > > > > > > The only other way I see to do it is carry a validity flag around > with each vector and reduce that flag in all the vector reductions; but > this alone is not enough we would also have to have some propagation code > for things like zero pivot, for example setting a validity flag in the Mat > factor (saying the factor is not valid) and propagating up those flags. We > get all these things "for free" with the Inf Nan approach. > > > There is an additional benefit: the validity flag would have to be > cleared by the caller to avoid "false positives" on subsequent calls. > That's an opportunity for bugs. With NaN the "error condition" (i.e., the > NaN entry) gets cleared automatically by a subsequent successful vector > operation. > > > > > > > > > What exactly caused the NaN would have to be signaled "out-of-band" as > the saying goes. One way to "signal" it is by the code path that led to the > error condition: that's why calling through KSP_MatMult() is useful. It's > not ideal, but covers the cases of immediate interest. > > > Dmitry. > > > > > > > > > > > > > >
