On 30-Sep-15 18:13, Ryan Brown wrote: > On 09/30/2015 03:10 AM, Anant Patil wrote: >> Hi, >> >> One of remaining items in convergence is detecting and handling engine >> (the engine worker) failures, and here are my thoughts. >> >> Background: Since the work is distributed among heat engines, by some >> means heat needs to detect the failure and pick up the tasks from failed >> engine and re-distribute or run the task again. >> >> One of the simple way is to poll the DB to detect the liveliness by >> checking the table populated by heat-manage. Each engine records its >> presence periodically by updating current timestamp. All the engines >> will have a periodic task for checking the DB for liveliness of other >> engines. Each engine will check for timestamp updated by other engines >> and if it finds one which is older than the periodicity of timestamp >> updates, then it detects a failure. When this happens, the remaining >> engines, as and when they detect the failures, will try to acquire the >> lock for in-progress resources that were handled by the engine which >> died. They will then run the tasks to completion. > > Implementing our own locking system, even a "simple" one, sounds like a > recipe for major bugs to me. I agree with your assessment that tooz is a > better long-run decision. > >> Another option is to use a coordination library like the community owned >> tooz (http://docs.openstack.org/developer/tooz/) which supports >> distributed locking and leader election. We use it to elect a leader >> among heat engines and that will be responsible for running periodic >> tasks for checking state of each engine and distributing the tasks to >> other engines when one fails. The advantage, IMHO, will be simplified >> heat code. Also, we can move the timeout task to the leader which will >> run time out for all the stacks and sends signal for aborting operation >> when timeout happens. The downside: an external resource like >> Zookeper/memcached etc are needed for leader election. > > That's not necessarily true. For single-node installations (devstack, > TripleO underclouds, etc) tooz offers file and IPC backends that don't > need an extra service. Tooz's MySQL/PostgreSQL backends only provide > distributed locking functionality, so we may need to depend on the > memcached/redis/zookeeper backends for multi-node installs. >
Definitely, for single-node installations, one can rely on IPC as backend. As a convention, a default provider for single node as IPC would be helpful for running heat in devstack or development environment. From a holistic perspective, I am referring to external resource, as mostly the deployments are multi-node with active-active HA. > Even if tooz doesn't provide everything we need, I'm sure patches > would be welcome. > I am sure when we dive in, we will find use cases for tooz as well. >> In the long run, IMO, using a library like tooz will be useful for heat. >> A lot of boiler plate needed for locking and running centralized tasks >> (such as timeout) will not be needed in heat. Given that we are moving >> towards distribution of tasks and horizontal scaling is preferred, it >> will be advantageous to use them. >> >> Please share your thoughts. >> >> - Anant __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev