Hi all,

Convergence-POC distributes stack operations by sending resource actions over 
RPC for any heat-engine to execute. Entire stack lifecycle will be controlled 
by worker/observer notifications. This distributed model has its own advantages 
and disadvantages.

Any stack operation has a timeout and a single engine will be responsible for 
it. If that engine goes down, timeout is lost along with it. So a traditional 
way is for other engines to recreate timeout from scratch. Also a missed 
resource action notification will be detected only when stack operation timeout 

To overcome this, we will need the following capability:

1.       Resource timeout (can be used for retry)

2.       Recover from engine failure (loss of stack timeout, resource action 


1.       Use task queue like celery to host timeouts for both stack and 

2.       Poll database for engine failures and restart timers/ retrigger 
resource retry (IMHO: This would be a traditional and weighs heavy)

3.       Migrate heat to use TaskFlow. (Too many code change)

I am not suggesting we use Task Flow. Using celery will have very minimum code 
change. (decorate appropriate functions)

Your thoughts.

IRC: ckmvishnu
OpenStack-dev mailing list

Reply via email to