Hi Cecilia, On Apr 18, 2012, at 9:23 AM, Cheng, Cecilia S (388K) wrote:
> Hi Cynthia, > > I think the most important point about shutting down the components > gracefully is so that tasks / jobs aren't lost. There are ways to achieve > that even though the 'stop' commands execute a brute 'kill'. Agreed -- one way is to simply maintain state, which the current Apache OODT trunk RM and WM do, for the RM in its Job Repository (interface), and in the WM via the WorkflowInstanceRepository (interface). These are updated periodically at different stages of execution. Then, the goal is to take that state, and have some simple commands to read that state on startup, and decide what to do. This was what Brian really did great in his wengine-branch and what we are working to do in the trunk right now. > > For example, you can pause the RM, so that no more jobs will be sent to > the batch stubs, then wait until all those running jobs are done before > you shut down the RM. Upon restart of the RM, the RM will rebuild its Q > from the state before the shutdown. Please note that these capabilities > are implemented in the branched RM. For the community when Cecilia says "branched RM", she is talking about the work that the ACOS project is doing at JPL. I'm encouraging them to work with the Apache community here on list to get those patches vetted by the PMC, and into the next version of the trunk (hopefully 0.5 once 0.4 is released -- yes I know we are behind -- /flails self ;) ). > ACOS has tested the pause capability, > but not the rebuild capability. All that's needed in trunk workflow is a command to read the current JobRepository history and make some decisions as to what to do with Jobs that aren't finished, based on the information captured about them. Folks are welcome to file a JIRA issue and work towards a solution for that. I'd be happy to help shepherd it in. > > You can do something similar to that in the WEngine as well. Agreed. This is what Brian Foster and I already suggested doing. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
