Hi Karen,

Thanks for writing this up, some comments in-line... :)

On 06/18/10 12:34 AM, Karen Tung wrote:
> Here's a summary of discussions we had this morning
> on the Install Execution Engine design.
>
> 1) Application's usage of InstallEngine.execute_checkpoints() and threading:
>
> - If an application calls execute_checkpoints() with a callback function,
>   execute_checkpoints() will return after all checkpoints are instantiated.
>   When the thread executing checkpoints is completed, the callback function
>   provided by the application will be called.

This sounds good, but I'd like to be sure that when looking at designing the
callback mechanism, that the method being called will be provided with
suitable status information - nothing much, just enough to know if it's a
success, failure, partial failure - so that it can do the correct thing.

I see two ways that this call back would be called:

1) From the Checkpoint-Execution-Thread - i.e. thread handling the running of
   checkpoints in sequence, on completion/failure, would call the callback.

2) The Engine calls the callback, after the Engine has been notified through
   some "internal" mechanism, that the Checkpoint-Execution-Thread has
   completed.

I don't have a preference for which is best, but which ever is chosen, then I
think that it will be important for document whether the callback will be
called from another thread (as in case 1), or from the same thread that the
Engine was called from (case 2).

>
> - If an application calls execute_checkpoints() function without
>   providing a callback function, execute_checkpoints() will not
>   return until all checkpoints are executed.

Agreed.

>
> 2) Canceling a checkpoint
>
> - It's the application's responsibility to setup a signal handler to process
>   signals such as control-c.

Agreed.

>
> - When the engine receives a cancel_checkpoints() request, it will
>   call the cancel() function of the checkpoint that's executing.
>
> - The default implementation for the AbstractCheckpoint.cancel() function
>   will be to set a threading.Event variable.  Checkpoints that do not
>   overwrite the default cancel() implementation should check the value of
>   this variable using the is_set() function, and perform the necessary
>   cleanup and exit.

I think it's important to mention here that it's understood that all
checkpoints are not able to be cleanly cancelled, as such the checking of the
variable really is meant to be used only for checkpoints with possibly
lengthly execution times, and where they have some acceptable "breaks" in this
execution that it could be cleanly cease any further action.

>
> - Checkpoints that do not want to use the default cancel()
>   implementation can overwrite with it's
>   own implementation when they subclass the AbstractCheckpoint object.

I would hope that this wouldn't happen too much, but it's good to have it
available.

>
> 3) stop-on-error
>
> - If stop-on-error is false, the engine will continue executing all
>   checkpoints despite exceptions from one or more of the checkpoints.

Agreed.

>
> - DOC and/or ZFS snapshots will be taken after each of the checkpoints
>   are executed, despite the exception(s).  If the application wants to
>   resume at a previously failed checkpoint and the stop-on-error flag is
>   false, the application is allowed to resume at that checkpoint if other
>   resume requirements are met.

Agreed.

>
> 4) AbstractCheckpoint.get_progress_estimate()
>
> - This function will return the number of seconds it takes to execute the
>   checkpoint in seconds as measured by the wall-clock, on a standardized
>   machine.

Agreed.

>
> - Developers who might not have access to the standardized machine or if the
>   standardized machine becomes obsolete in the future, can run one of the
>   existing checkpoints that perform similar operation to their checkpoint on
>   any available machine and use that as a guidance to figure out the
>   approximate number of seconds it takes to run the newly developed
>   checkpoint.

I think this is where we might benefit from having a utility program that
provides these measurements - or at the minimum a documented process for
exactly how to perform this on the standardized machine, and on a
non-standardized machine (i.e. the future machine).

>
> 5) Keith's question about using Error Service module (errsvc) for storing
>    exceptions raised by the checkpoints, instead of storing the exceptions
>    as a list.
>
> - the Error Service module is suitable and can be used with some
>   modifications.
>
> - The ErrorInfo object can be used to store the exception.
>   The mod_id in the ErrorInfo object can be used for storing the name of the
>   checkpoint that raised the exception.
>
> - As currently implemented, ErrorInfo object only accepts "integer" and
>   "string" as the error data type.  It needs to be modified to accept an
>   "object", which will be used for storing the exception raised by the
>   checkpoint.

Is it really necessary to store the object? Would a str(object) not suffice?
I'm mainly wondering if it's worth the effort to add support for objects to
the API, when I really don't think anyone needs that much information.

>
> 6) After a checkpoint completes successfully, the engine will always send
>    a progress update to the logger on the overall percentage complete.  This
>    allows accurate progress to be reported even if a checkpoint does not
>    report intermediate progress.

Agreed.

Thanks,

Darren.
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Reply via email to