Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

Joshua Harlow Wed, 26 Mar 2014 14:51:06 -0700

Cool, sounds great.

I think all 3 models can co-exist (since each serves a good purpose), it'd be 
intersting to see how the POC 'engine' can become a taskflow 'engine' (aka the 
lazy_engine).


As to scalability I agree lazy_engine would be nicer, but how much more 
scalable is a tough one to quantify (the openstack systems that have active 
conductors, aka model #2, seem to scale pretty well).

Of course there are some interesting questions laziness brings up; it'd be 
interesting to see how the POC addressed them.

Some questions I can think of (currently), maybe u can address them in the 
other thread (which is fine to).

What does the watchdog do? Is it activated periodically to 'reap' jobs that 
have timed out (or have gone past some time limit)? How does the watchdog know 
that it is reaping jobs that are not actively being worked on (a timeout likely 
isn't sufficient for jobs that just take a very long time)? Is there a 
connection into zookepeer (or some similar system) to do this kind of 
'liveness' verification instead? What does the watchdog do when reaping tasks? 
(revert them, retry them, other..?)

I'm not quite sure how a taskflow would use mistral as a client for this 
watchdog, since the watchdog process is pretty key to the lazy_engines 
execution model and it seems like it would be a bad idea to split that logic 
from the actual execution model itself (seeing that the watchdog is involved in 
the execution process, and really isn't external to it). To me the concept of 
the lazy_engine is similar to the case where an engine 'crashes' while running, 
in a way the lazy_engine 'crashes on purpose' after asking a set of workers to 
do some action (and hands over the resumption of 'itself' to this watchdog 
process). The watchdog then watches over the workers, and on response from some 
worker the watchdog resumes the engine and then lets the engine 'crash on 
purpose' again (and repeat). So the watchdog <-> lazy_engine execution model 
seems to be pretty interconnected.

-Josh

From: Dmitri Zimine <[email protected]<mailto:[email protected]>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, March 26, 2014 at 2:12 PM
To: "OpenStack Development Mailing List (not for usage questions)" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

=== Long-running delegate [1] actions ==

Yes, the third model of lazy / passive engine is needed.

Obviously workflows contain a mix of different tasks, so this 3rd model should 
handle both normal tasks (run on a workers and return) and long running 
delegates. The "active mechanism which is alive during the process", currently 
in done by TaskFlow engine,  may be moved from the TaskFlow library to a client 
(Mistral) which implements the watchdog. This  may require a lower-level API to 
TaskFlow.

The benefit of the model 2 is 'ease of use' for some clients (create tasks, 
define flow, instantiate engine, engine.run(), that's it!). But I agree that 
the model 2 - worker-based TaskFlow engine - won't scale to WFaaS requirements 
even though the engine is not doing much.

Mistral POC implements  a passive, lazy workflow model: a service moving the 
states of multiple parallel executions. I'll detail the how Mistral handles 
long running tasks in a separate thread (may be here 
http://tinyurl.com/n3v9lt8) and we can look at how TaskFlow may change to fit.

DZ>

PS. Thanks for clarifications on the target use cases for the execution models!


[1] Calling them 'delegate actions' to distinguish between long running 
computations on workers, and actions that delegate to 3rd party systems (hadoop 
job, human input gateway, etc).


On Mar 24, 2014, at 11:51 AM, Joshua Harlow 
<[email protected]<mailto:[email protected]>> wrote:

So getting back to this thread.

I'd like to split it up into a few sections to address the HA and 
long-running-actions cases, which I believe are 2 seperate (but connected) 
questions.

=== Long-running actions ===

First, let me describe a little bit about what I believe are the execution 
models that taskflow currently targets (but is not limited to just targeting in 
general).

The first execution model I would call the local execution model, this model 
involves forming tasks and flows and then executing them inside an application, 
that application is running for the duration of the workflow (although if it 
crashes it can re-establish the task and flows that it was doing and attempt to 
resume them). This could also be what openstack projects would call the 
'conductor' approach where nova, ironic, trove have a conductor which manages 
these long-running actions (the conductor is alive/running throughout the 
duration of these workflows, although it may be restarted while running). The 
restarting + resuming part is something that openstack hasn't handled so 
gracefully currently, typically requiring either some type of cleanup at 
restart (or by operations), with taskflow using this model the resumption part 
makes it possible to resume from the last saved state (this connects into the 
persistence model that taskflow uses, the state transitions, how execution 
occurrs itself...).

The second execution model is an extension of the first, whereby there is still 
a type of 'conductor' that is managing the life-time of the workflow, but 
instead of locally executing tasks in the conductor itself tasks are now 
executed on remote-workers (see http://tinyurl.com/lf3yqe4
). The engine currently still is 'alive' for the life-time of the execution, 
although the work that it is doing is relatively minimal (since its not 
actually executing any task code, but proxying those requests to others works). 
The engine while running does the conducting of the remote-workers (saving 
persistence details, doing state-transtions, getting results, sending requests 
to workers...).

As you have already stated, if a task is going to run for 5+ days (some really 
long hadoop job for example) then these 2 execution models may not be suited 
for this type of usage due to the current requirement that the engine 
overseeing the work must be kept alive (since something needs to recieve 
responses and deal with state transitions and persistence). If the desire is to 
have a third execution model, one that can handle with extremly long-running 
tasks without needing an active mechanism that is 'alive' during this process 
then I believe that would call for the creation of a new engine type in 
taskflow (https://github.com/openstack/taskflow/tree/master/taskflow/engines) 
that deals with this use-case. I don't beleive it would be hard to create this 
engine type although it would involve more complexity than what exists. 
Especially since there needs to be some 'endpoint' that recieves responses when 
the 5+ day job actually finishes (so in this manner some type of code must be 
'always' running to deal with these responses anyway). So that means there 
would likely need to be a 'watchdog' process that would always be running that 
itself would do the state-transitions and result persistence (and so-on), in a 
way this would be a 'lazy' version of the above first/second execution models.

=== HA ===

So this is an interesting question, and to me is strongly connected to how your 
engines are executing (and the persistence and state-transitions that they go 
through while running). Without persistence of state and transitions there is 
no good way (a bad way of course can be created, by just redoing all the work, 
but that's not always feasible or the best option) to accomplish resuming in a 
sane manner and there is also imho no way to accomplish any type of automated 
HA of workflows. Since taskflow was concieved to manage the states and 
transitions of tasks and flows it gains the ability to do this resuming but it 
also gains the ability to automatically provide execution HA to its users.

Let me describe:

When you save the states of a workflow and any intermediate results of a 
workflow to some database (for example) and the engine (see above models) which 
is being used (for example the conductor type from above) the application 
containing that engine may be prone to crashes (or just being powered off due 
to software upgrades...). Since taskflows key primitives were made to allow for 
resuming when a crash occurs, it is relatively simple to allow another 
application (also running a conductor) to resume whatever that prior 
application was doing when it crashed. Now most users of taskflow don't want to 
have to do this resumption manually (although they can if they want) so it 
would be expected that the other versions of that application would be running 
would automatically 'know' how to 'take-over' the work of the failed 
application. This is where the concept of the taskflows 'jobboard' 
(http://tinyurl.com/klg358j) comes into play, where a jobboard can be backed by 
something like zookeeper (which provides notifications of lock lose/release to 
others automatically). The jobboard is the place where the other applications 
would be looking to 'take-over' the other failed applications work (by using 
zookeeper 'atomic' primitives designed for this type of usage) and they would 
also be releasing the work back for others to 'take-over' when there own 
zookeeper connection is lost (zookeeper handles this this natively).

--

Now as for how much of mistral would change from the above, I don't know, but 
thats why it's a POC.

-Josh

From: Joshua Harlow <[email protected]<mailto:[email protected]>>
Date: Friday, March 21, 2014 at 1:14 PM
To: "OpenStack Development Mailing List (not for usage questions)" 
<[email protected]<mailto:[email protected]>>
Cc: "OpenStack Development Mailing List (not for usage questions)" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

Will advise soon, out sick with not so fun case of poison oak, will reply next 
week (hopefully) when I'm less incapacitated...

Sent from my really tiny device...

On Mar 21, 2014, at 3:24 AM, "Renat Akhmerov" 
<[email protected]<mailto:[email protected]>> wrote:

Valid concerns. It would be great to get Joshua involved in this discussion. If 
it’s possible to do in TaskFlow he could advise on how exactly.

Renat Akhmerov
@ Mirantis Inc.



On 21 Mar 2014, at 16:23, Stan Lagun 
<[email protected]<mailto:[email protected]>> wrote:

Don't forget HA issues. Mistral can be restarted at any moment and need to be 
able to proceed from the place it was interrupted on another instance. In 
theory it can be addressed by TaskFlow but I'm not sure it can be done without 
complete redesign of it


On Fri, Mar 21, 2014 at 8:33 AM, W Chan 
<[email protected]<mailto:[email protected]>> wrote:
Can the long running task be handled by putting the target task in the workflow 
in a persisted state until either an event triggers it or timeout occurs?  An 
event (human approval or trigger from an external system) sent to the transport 
will rejuvenate the task.  The timeout is configurable by the end user up to a 
certain time limit set by the mistral admin.

Based on the TaskFlow examples, it seems like the engine instance managing the 
workflow will be in memory until the flow is completed.  Unless there's other 
options to schedule tasks in TaskFlow, if we have too many of these workflows 
with long running tasks, seems like it'll become a memory issue for mistral...


On Thu, Mar 20, 2014 at 3:07 PM, Dmitri Zimine 
<[email protected]<mailto:[email protected]>> wrote:

For the 'asynchronous manner' discussion see http://tinyurl.com/n3v9lt8; I'm 
still not sure why u would want to make is_sync/is_async a primitive concept in 
a workflow system, shouldn't this be only up to the entity running the workflow 
to decide? Why is a task allowed to be sync/async, that has major side-effects 
for state-persistence, resumption (and to me is a incorrect abstraction to 
provide) and general workflow execution control, I'd be very careful with this 
(which is why I am hesitant to add it without much much more discussion).

Let's remove the confusion caused by "async". All tasks [may] run async from 
the engine standpoint, agreed.

"Long running tasks" - that's it.

Examples: wait_5_days, run_hadoop_job, take_human_input.
The Task doesn't do the job: it delegates to an external system. The flow 
execution needs to wait  (5 days passed, hadoob job finished with data x, user 
inputs y), and than continue with the received results.

The requirement is to survive a restart of any WF component without loosing the 
state of the long running operation.

Does TaskFlow already have a way to do it? Or ongoing ideas, considerations? If 
yes let's review. Else let's brainstorm together.

I agree,
that has major side-effects for state-persistence, resumption (and to me is a 
incorrect abstraction to provide) and general workflow execution control, I'd 
be very careful with this
But these requirement  comes from customers'  use cases: wait_5_day - lifecycle 
management workflow, long running external system - Murano requirements, user 
input - workflow for operation automations with control gate checks, provisions 
which require 'approval' steps, etc.

DZ>


_______________________________________________
OpenStack-dev mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




--
Sincerely yours
Stanislav (Stan) Lagun
Senior Developer
Mirantis
35b/3, Vorontsovskaya St.
Moscow, Russia
Skype: stanlagun
www.mirantis.com<http://www.mirantis.com/>
[email protected]<mailto:[email protected]>
_______________________________________________
OpenStack-dev mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
_______________________________________________
OpenStack-dev mailing list
[email protected]<mailto:[email protected]>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Mistral][TaskFlow] Long running actions

Reply via email to