Quick summary of interesting discussions yesterday at the summit that relate to
things we will face in Solum wrt async flows.
The two nova sessions on async work [1] and the task API [2] had a lot of good
back and forth. The problem space is how to model and convey long running
tasks in the nova API, and then how to start moving long running tasks into a
consistent place in the nova code base. There appeared to be broad consensus
that this move should and would happen in icehouse for a few important tasks
(snapshot) and the rough shape of an API, but that there are a lot of open
questions about how to best handle the hard problems (flow state persistence,
read/write access patterns into a persistent store, how to make tasks
idempotent across retries and in the face of partitions and distributed
transactions).
A highlight for me was that it almost exactly (down to a very low level)
matched a set of discussions we've been having in Openshift. The problem space
is the same - you have a virtual resource (application) that manifests as a
distributed set of servers that must be coordinated. You want to create (but
create can be long running and can fail very late in the flow), you can restart
and start these resources (usually in parallel), delete needs to be able to cut
across a deep queue of work, and (although this isn't yet a nova problem, but
it will be a heat/Solum problem) you need to allow multiple operations to
execute in parallel. These are all application life cycle problems that Heat
and Solum will have to deal with - with Solum potentially providing a thin
layer on top of the Heat calls (or no layer).
The other session was glance and taskflow [3] - they had general consensus to
move ahead with their task API on top of a task flow implementation for a few
of their existing log running tasks. Someone from cinder talked about their
experience - some of the known gaps in task flow include restart of a job at a
previous checkpoint (there are other domain problems on top of that of course)
as well as the distributed execution engine for task flow (that would allow
work to be more easily distributed across a cluster). Some follow up
discussion included the need for there to be general collaboration across the
teams on demonstrating patterns of use around the harder problems (restart of
flows, different types of distributed retry and failure recovery, idempotent
calls).
For Solum, I think we need to be seriously prototyping a few relevant long
running tasks (create, build, deploy) using task flow and get familiar with the
model. And likewise, we need to be following the task API work in nova and
glance closely, and working with heat and others to track this work.
[1] https://etherpad.openstack.org/p/IcehouseConductorTasksNextSteps
[2] https://etherpad.openstack.org/p/IcehouseTaskAPI
[3] https://etherpad.openstack.org/p/icehouse-summit-taskflow-and-glance
_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev