Hi, One of remaining items in convergence is detecting and handling engine (the engine worker) failures, and here are my thoughts.
Background: Since the work is distributed among heat engines, by some means heat needs to detect the failure and pick up the tasks from failed engine and re-distribute or run the task again. One of the simple way is to poll the DB to detect the liveliness by checking the table populated by heat-manage. Each engine records its presence periodically by updating current timestamp. All the engines will have a periodic task for checking the DB for liveliness of other engines. Each engine will check for timestamp updated by other engines and if it finds one which is older than the periodicity of timestamp updates, then it detects a failure. When this happens, the remaining engines, as and when they detect the failures, will try to acquire the lock for in-progress resources that were handled by the engine which died. They will then run the tasks to completion. Another option is to use a coordination library like the community owned tooz (http://docs.openstack.org/developer/tooz/) which supports distributed locking and leader election. We use it to elect a leader among heat engines and that will be responsible for running periodic tasks for checking state of each engine and distributing the tasks to other engines when one fails. The advantage, IMHO, will be simplified heat code. Also, we can move the timeout task to the leader which will run time out for all the stacks and sends signal for aborting operation when timeout happens. The downside: an external resource like Zookeper/memcached etc are needed for leader election. In the long run, IMO, using a library like tooz will be useful for heat. A lot of boiler plate needed for locking and running centralized tasks (such as timeout) will not be needed in heat. Given that we are moving towards distribution of tasks and horizontal scaling is preferred, it will be advantageous to use them. Please share your thoughts. - Anant __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev