Quick summary of interesting discussions yesterday at the summit that relate to 
things we will face in Solum wrt async flows.

The two nova sessions on async work [1] and the task API [2] had a lot of good 
back and forth.  The problem space is how to model and convey long running 
tasks in the nova API, and then how to start moving long running tasks into a 
consistent place in the nova code base.  There appeared to be broad consensus 
that this move should and would happen in icehouse for a few important tasks 
(snapshot) and the rough shape of an API, but that there are a lot of open 
questions about how to best handle the hard problems (flow state persistence, 
read/write access patterns into a persistent store, how to make tasks 
idempotent across retries and in the face of partitions and distributed 
transactions).

A highlight for me was that it almost exactly (down to a very low level) 
matched a set of discussions we've been having in Openshift.  The problem space 
is the same - you have a virtual resource (application) that manifests as a 
distributed set of servers that must be coordinated.  You want to create (but 
create can be long running and can fail very late in the flow), you can restart 
and start these resources (usually in parallel), delete needs to be able to cut 
across a deep queue of work, and (although this isn't yet a nova problem, but 
it will be a heat/Solum problem) you need to allow multiple operations to 
execute in parallel.  These are all application life cycle problems that Heat 
and Solum will have to deal with - with Solum potentially providing a thin 
layer on top of the Heat calls (or no layer).

The other session was glance and taskflow [3] - they had general consensus to 
move ahead with their task API on top of a task flow implementation for a few 
of their existing log running tasks.  Someone from cinder talked about their 
experience - some of the known gaps in task flow include restart of a job at a 
previous checkpoint (there are other domain problems on top of that of course) 
as well as the distributed execution engine for task flow (that would allow 
work to be more easily distributed across a cluster).  Some follow up 
discussion included the need for there to be general collaboration across the 
teams on demonstrating patterns of use around the harder problems (restart of 
flows, different types of distributed retry and failure recovery, idempotent 
calls).

For Solum, I think we need to be seriously prototyping a few relevant long 
running tasks (create, build, deploy) using task flow and get familiar with the 
model.  And likewise, we need to be following the task API work in nova and 
glance closely, and working with heat and others to track this work.

[1] https://etherpad.openstack.org/p/IcehouseConductorTasksNextSteps
[2] https://etherpad.openstack.org/p/IcehouseTaskAPI
[3] https://etherpad.openstack.org/p/icehouse-summit-taskflow-and-glance
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to