Josh's pull request <https://github.com/apache/spark/pull/12433> on rpc
exception handling got me to think ...

In my experience, there have been a few things related exceptions that
created a lot of trouble for us in production debugging:

1. Some exception is thrown, but is caught by some try/catch that does not
do any logging nor rethrow.
2. Some exception is thrown, but is caught by some try/catch that does not
do any logging, but do rethrow. But the original exception is now masked.
2. Multiple exceptions are logged at different places close to each other,
but we don't know whether they are caused by the same problem or not.


To mitigate some of the above, here's an idea ...

(1) Create a common root class for all the exceptions (e.g. call it
SparkException) used in Spark. We should make sure every time we catch an
exception from a 3rd party library, we rethrow them as SparkException (a
lot of places already do that). In SparkException's constructor, log the
exception and the stacktrace.

(2) SparkException has a monotonically increasing ID, and this ID appears
in the exception error message (say at the end).


I think (1) will eliminate most of the cases that an exception gets
swallowed. The main downside I can think of is we might log an exception
multiple times. However, I'd argue exceptions should be rare, and it is not
that big of a deal to log them twice or three times. The unique ID (2) can
help us correlate exceptions if they appear multiple times.

Thoughts?

Reply via email to