I have updated the functional/design spec for the BE error and
observability project based on all of the feedback to this point.
I believe I have covered all of the issues and concerns raised to
this point.
Please provide your comments by Thursday 8/27.
Thanks,
-evan
Problem statement:
Currently in libbe when there is an error during a call into the
library only error codes are returned. While these error code
provide some information on why an operation failed they do not
provide enough context to tell the user what actually caused the
problem or what they may need to do the solve it.
To get more context the user can turn on extra error and debug
output through the use of the BE_PRINT_ERR environment variable.
However this also does not always provide enough information and
can cause the user some confusion. This is also a problem since
it requires the ability to print out these messages directly from
the library and requires the user to rerun the failing command to
retrieve the needed error or debug output.
Scope:
- This project will provide for the ability to return a an nvlist of
information describing a failure and it's context from calls into libbe.
- It is expected that this will not replace the use of be_print_err
throughout the library at this time.
- As we move forward and are able to provide all of the needed
error information for all error conditions, all instances of
be_print_err will be removed. However it's removal is not
planned for this release.
- We will not provide the overall library for handling errors and
logging as described in the Caiman Unified Design (CUD) documents.
- However the design here is such that the code will be generic
enough that moving into something that will fit with CUD
will be easier.
Requirements:
- Calls into the library need to include enough information to
determine the cause of a failure and possible solutions to that
failure.
- The error information should include:
- The operation being performed (the entry point into the
library such as activating a BE).
- What was being performed when the error occured (for
example running installgrub).
- What the failure was (for example what was the error
string returned form installgrub or a zfs_promote call).
- What steps can be taken to correct the problem or if this
is not available a link to more information on possible
issues to check.
- As stated above the design will be generic enough that moving
into something that will fit with CUD will be easier. This will
be done by keeping the calls in a separate file and header file
that can at a later time be removed. Also the functionality itself
will be kept generic enough that it can be moved easily outside
of libbe.
Requirements on other projects
- This project will require changes to any consumer of libbe so that
they can make use of this new error information.
Errors Corrected Internally:
- For errors we can fix internal to the library we will use a linked list
in the library handle which will allow us to relay any informational
data that may need to be reported back to the consumer on the
corrective action taken. This linked list will be made up of the same
data structures shown below and will use the same interfaces to retrieve
this information.
For example if we find that the grub menu is missing we attempt to
create a new menu.lst file. When this is done the corrective action
would be added to the linked list of fixed error data. For these
the error type will always be "no error" since the error was corrected.
- When logging is available this information can also be logged from
within the library and separately from this linked list.
Logging:
- The logging side of things is outside the scope of this project and will
be done as part of the Caiman Unified Design project. That being said we
can see the possibility for two types or levels of logging that may be
needed. The first is logging that the consumer of the library will do.
This will be based on the information returned through the library's handle.
There is also the need for some debugging form of logging this will be
done inside the library.
Library Handle:
- The library interfaces will be changed to pass back a handle which will
contain primarily the error information. This handle will be allocated by
the library and returned to the consumer. When the consumer has retrieved
the information they are interested in the handle must then be closed
which will free up the memory and any other clean up that may be needed.
Structure definitions:
internal to the library:
struct err_info {
union {
int ei_err_num; /* this is a be_errno */
int ei_op_num; /* enum of libbe operations
*/
int ei_fixit_str_num; /* enum of fixit
strings
* or URL's */
int ei_failed_at; /* enum of function calls
*/
char ei_failed_str[MAXLEN]; /* error string
returned
* from failure
*/
} ei_info;
int ei_err_type; /* The type of failure */
};
enum {
EI_NO_ERR = 0,
EI_BE_ERR = 5000, /* libbe errors */
EI_BE_CLEANUP /* libbe cleanup errors */
} err_type;
Public definitions:
typedef struct err_info_list {
err_info_t *el_err_info;
err_info_t *next;
} err_info_list_t;
typedef struct be_handle {
err_info_t *be_err_info; /* information for the actual
failure */
err_info_t *be_cleanup_info; /* information on any needed
cleanup */
err_info_list_t *be_fixed_err_info; /* list of errors fixed
internally */
....
} be_handle_t;
typedef struct err_info err_info_t;
Public Functions:
These functions are used to access the fields in the data structure as
the err_info structure itself will be encapsulated within the library.
/* retrieves error information */
int be_get_err_info(err_info_t *be_err_info, nvlist_t *be_err_info);
/* retrieves any cleanup information needed due to error */
int be_get_cleanup_info(err_info_t *be_cleanup_info, nvlist_t *be_err_info);
/* closes the library handle and frees up the error and clean-up information. */
int be_close_handle (be_handle_t *be_hd);
The information from these nvlists is then pulled into specific dictionaries
for these types of errors within the libbe python module and then returned to
consumers of the module. The information can then be used as the consumer
chooses.
Deliverables:
- In addition to library changes to support the components mentioned
above we will need to add the following:
- Addition of code that will try to determine what the user may need
to do to correct the problem.
- Addition of a handle that will be passed back from library calls.
This handle will contain the error, cleanup and corrected error
information. Accessor functions will be used to retrieve the error
information out of the error structures attached to this handle.
- Additional documentation on the existence of this error
information
- Addition of the html content that describes possible solutions
to various errors. This will at first be minimal with more
information added as more errors are found where a solution can't
be determined from the available information.
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss