Better error reporting in Mesos

Michael Hopcroft Fri, 06 Nov 2015 00:30:01 -0800

Error reporting in Mesos is based on Stout’s Error class and a variety of
other classes, such as ErrnoError and WindowsError, that derive from Error.
One goal of these classes is to simplify the construction and storage of
error message strings. As an example, the ErrnoError constructor
automatically incorporates std::string(strerror(errno)) into its error
message.



Stout catches exceptions and converts them into Try<T>  and Result<T>
objects which are used as polymorphic return values that can express either
an error condition or a proper return value. One consequence of catching
exceptions is that call stacks from throw sites and error sites are lost.
This makes it harder to debug production environment crashes post hoc.


One way to improve debugability would be to add more information about the
error site to the error message. This could be done by convention, but it
seems that the vast majority of the error messages in Mesos currently don’t
include much information about the site that originally generated an error.


Because much of the code in Mesos uses Error and ErrnoError, we have an
opportunity to add diagnostic information automatically with only a small
amount of churn. What we’d need to do is provide Error creation macros that
would pass the source file name and line number to the constructors of the
various Error classes.


Concretely this would involve


   - class Error
      - Rename declaration and definition to ErrorImpl (or ErrorBase). This
      change is not error prone.
      - Rename references to class’ type. These changes are mostly limited
      to result.hpp and try.hpp.
      - Don’t rename calls to constructor. These will now just invoke the
      macro.
      - Modify the constructor to take a line number and source filename in
      addition to a message.
      - Define a macro called Error(message) that has the same prototype as
      the original Error constructor. This macro would invoke
      ErrorImpl::ErrorImpl, passing the current line number and source filename
      along with the user-provided message. The macro would make use
of __FILE__
      and __LINE__, which are predefined in gcc and Visual Studio.
   - class ErrnoError, WindowsError.
      - Similar pattern as above.

*Pros.*


   - Codebase is more debuggable in production environment because the vast
   majority of error sites will document the file name and line number in the
   error message.
   - Most error sites don’t require a code change.

*Cons.*


   - Introduces a macro. In general, we’d like to move away from macros
   wherever possible.
   - Discourages subclassing of Error because each subclass would need its
   own special macro to call its constructor. A brief search of the code base
   seems to indicate that Mesos programmers don’t tend to declare subclasses
   of Error. Right now, it seems that subclasses exist mainly to provide
   different construction semantics. The big question is whether one would
   want to use subclasses down the road to distinguish between equivalence
   classes of errors, say those that should bring the process down vs those
   that are recoverable.
   - Some amount of code churn, mainly in error.hpp, result.hpp, and
   try.hpp.

*Limitations.*


   - This approach by itself won’t address error sites that invoke
   convenience functions such as Try<T>::error() and Result<T>::error(). These
   could be addressed with a similar macro, but it would need to somehow deal
   with the template parameter as well.

*Risks and Potential Complications.*


   - The rename process might fail to correctly distinguish between error
   sites (i.e. calls to Error::Error()) and other uses of the term “Error”.
   - There may be important compilers that don’t support __FILE__ and
   __LINE__. These predefined macros have been in the ANSI/ISO standard for a
   while, but it would still be wise to check each compiler we care about.
   - General hazards associated with macro expansion.
   - Users of Mesos may have written diagnostic tools that depend on the
   current values of the error strings.
   - Unit tests may depend on exact values of error strings. If this were
   the case, we'd have to provide a convenience function that strips off the
   file names and line numbers so that unit tests don't break when error sites
   move around due to edits elsewhere in the file.

Let me know what you think of this idea. If the approach seems reasonable
and a worthwhile, I would be willing to prototype it to see if it has some
fatal flaw. Thanks for your input!



-Mike

Better error reporting in Mesos

Reply via email to