If someone on the team has a little bit of spare time (Brock? Johansen?),
could they take a look at the BE error observability document Evan sent out
on caiman-discuss?  This'll likely affect us (or we'll want to take
advantage of it), so we ought to throw an opinion in.  I'll look at it if I
can, but I want to make sure that someone from the team does.

    http://mail.opensolaris.org/pipermail/caiman-discuss/2009-August/013735.html
    
http://mail.opensolaris.org/pipermail/caiman-discuss/attachments/20090818/b2377951/attachment.txt

Evan's requested comments by COB Friday, though I imagine we can ask for a
bit more time if necessary.

Thanks,
Danek
--- Begin Message ---

Here is the functional/design spec for the BE error and observability project.

It's a first version and does include some design elements for what will eventually become the error handling and logging service for the Caiman Unified Design project. However this project is not intended to provide that error handling and logging but just the beginnings of it and the ability to mesh easily with it.

I would like to get any comments by the end of the day on Friday (8/21).

Thanks!
-evan

Problem statement:
Currently in libbe when there is an error during a call into the
library only error codes are returned. While these error code
provide some information on why an operation failed they do not
provide enough context to tell the user what actually caused the
problem or what they may need to do the solve it.

To get more context the user can turn on extra error and debug
output through the use of the BE_PRINT_ERR environment variable.
However this also does not always provide enough information and
can cause the user some confusion. This is also a problem since
it requires the ability to print out these messages directly from
the library and requires the user to rerun the failing command to
retrieve the needed error or debug output.

Scope:
- This project will provide for the ability to return a set of strings
  describing a failure and it's context from calls into libbe.
- It is expected that this will not replace the use of be_print_err
  throughout the library at this time.
    - As we move forward and are able to provide all of the needed
      error information for all error conditions, all instances of
      be_print_err will be removed. However it's removal is not
      planned for this release.
- We will not provide the overall library for handling errors and
  logging as described in the Caiman Unified Design (CUD) documents.
    - However the design here is such that the code will be generic
      enough that moving into something that will fit with CUD
      will be easier. 

Requirements:
- Calls into the library need to include enough information to
  determine the cause of a failure and possible solutions to that
  failure.
    - The error information should include:
        - The operation being performed (the entry point into the
          library such as activating a BE).
        - What was being performed when the error occured (for
          example running installgrub).
        - What the failure was (for example what was the error
          string returned form installgrub or a zfs_promote call).
        - What steps can be taken to correct the problem or if this
          is not available a link to more information on possible
          issues to check.
- As stated above the design will be generic enough that moving
  into something that will fit with CUD will be easier. This will
  be done by keeping the calls in a separate file and header file
  that can at a later time be removed. Also the functionality itself
  will be kept generic enough that it can be moved easily outside
  of libbe.


Requirements on other projects
- This project will require changes to any consumer of libbe so that
  they can make use of this new error information.

Error String definitions:
    The entry points into the library would be changed to pass back
    a structure containing strings with information about the failure.
    The idea here is that the caller will then be able to build their
    own error message based on this available information instead of
    being locked into the error message that we construct.
        - for example:

          typedef struct err_info_str{
              char *cmd_str; /* who called into the library (beadm, pkg(5)) */
              char *op_str; /* operation such as BE activate */
              char *failed_at; /* where we failed */
              char *fail_msg_str; /* error string from failure */
              char *fail_fixit_str; /* Context for error, instructions on
                                       what to check or link to html content
                                       if instructions can't be determined. */
           } err_info_str_t;

        char *cmd_str - This is the name of the command or function that has
                        called into the library. It is expected that the
                        caller will provide this information before calling
                        into the library if they which this to be part of the
                        error strings. At some later point it is expected that
                        this would be used for logging purposes.
        char *op_str - This is the higher level operation being performed, such
                       as activating a BE.
        char *failed_at - This is the functional area or call that failed. This
                           would be things like zfs_mount or installgrub
        char *fail_msg_str - This is any error string returned form the failed
                         call. This would be things like the contents of the
                         string returned by libzfs_error_description() or the
                         captured error string from installgrub.
        char *fail_fixit_str - This string will be generated by the code calling
                           into the code that will fill in the structure. This
                           generated string will contain information on what
                           the user can do to attempt to fix the cause of the
                           failure. If this can not be determined then instead
                           of this information a link to more information with
                           possible fixes for errors in this area will be used
                           for this string.

Accessor Functions:
These functions are used to access the fields in the data structure as the 
structure itself will be encapsulated.
    - err_set_cmd_str(err_info_str_t *err_info, char *cmd_str)
        - This is used by the caller before calling into the library to set the
          name of the command or function calling into the library.
    - err_set_op_str(err_info_str_t *err_info, char *op_str)
        - This function adds the operation to the error string structure.
    - err_get_op_str(err_info_str_t *err_info)
       - this function will return the contents of the option string from
         err_info_str structure.
    - err_set_fail_strs(err_info_str_t *err_info, char *failed_at,
      char *fail_msg_str, char *fail_fixit_str)
        - This function fills in the failed_at, fail_str and fail_info into the
          err_info_str_t structure
        - This functions can be used to fill in any or all of these fields in
          the structure.
    - err_get_failed_at_str(err_info_str_t *err_info)
        - retrieves the failed_at string from the structure
    - err_get_fail_msg_str(err_info_str_t *err_info)
        - retrieves the fail_msg_str string from the structure
    - err_get_fixit_str(err_info_str_t *err_info)
        - retrieves the fail_fixit_str string from the structure

Deliverables:
    - In addition to library changes to support the components mentioned
      above we will need to add the following:
        - Addition of code that will try to determine what the user may need
          to do to correct the problem.
        - Addition of the error string or structure to the nvlist passed
          back to the caller of libbe. This includes changing be_list and
          be_free_list to pass nvlists.
        - Additional documentation on the existence of this error
          information
        - Addition of the html content that describes possible solutions
          to various errors. This will at first be minimal with more
          information added as more errors are found where a solution can't
          be determined from the available information.
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

--- End Message ---
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to