Hi Cecilia,

On Apr 18, 2012, at 9:23 AM, Cheng, Cecilia S (388K) wrote:

> Hi Cynthia,
> 
> I think the most important point about shutting down the components
> gracefully is so that tasks / jobs aren't lost. There are ways to achieve
> that even though the 'stop' commands execute a brute 'kill'.

Agreed -- one way is to simply maintain state, which the current Apache
OODT trunk RM and WM do, for the RM in its Job Repository (interface), 
and in the WM via the WorkflowInstanceRepository (interface). These 
are updated periodically at different stages of execution.

Then, the goal is to take that state, and have some simple commands
to read that state on startup, and decide what to do. This was what 
Brian really did great in his wengine-branch and what we are working
to do in the trunk right now.

> 
> For example, you can pause the RM, so that no more jobs will be sent to
> the batch stubs, then wait until all those running jobs are done before
> you shut down the RM. Upon restart of the RM, the RM will rebuild its Q
> from the state before the shutdown. Please note that these capabilities
> are implemented in the branched RM.

For the community when Cecilia says "branched RM", she is talking about
the work that the ACOS project is doing at JPL. I'm encouraging them to
work with the Apache community here on list to get those patches vetted
by the PMC, and into the next version of the trunk (hopefully 0.5 once
0.4 is released -- yes I know we are behind -- /flails self ;) ).

> ACOS has tested the pause capability,
> but not the rebuild capability.

All that's needed in trunk workflow is a command to read the current 
JobRepository
history and make some decisions as to what to do with Jobs that aren't 
finished, 
based on the information captured about them. Folks are welcome to file a JIRA
issue and work towards a solution for that. I'd be happy to help shepherd it in.

> 
> You can do something similar to that in the WEngine as well.

Agreed. This is what Brian Foster and I already suggested doing.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to