prateekm opened a new pull request #1570:
URL: https://github.com/apache/samza/pull/1570


   Symptom: Container can deadlock during state restore on startup.
    
   Cause: ContainerStorageManager kicks off TaskRestoreManager#restore and 
blocks for restore to complete on the restoreExecutor. TaskRestoreManagers can 
restore state asynchronously using ContainerStorageManager's restoreExecutor.  
If TaskRestoreManagers schedule additional asynchronous tasks on the 
restoreExecutor and block (Future#get or CompletableFuture#join) for them to 
complete, it can cause a deadlock if num restore executor threads <= num tasks. 
This is because all threads in restoreExecutor (if num threads <= num tasks) 
would be blocked by TaskRestoreCallable's that are waiting for restore to 
finish and the asynchronous work will never be executed. The workaround to keep 
num threads > num tasks can be inefficient for containers with a large number 
of tasks.
    
   Changes: 
   a) Made TaskRestoreManager#restore return a future instead of blocking. Note 
that restore managers must still take care to not block for completion for 
futures scheduled on the restore executor. This PR makes it so they're not 
forced to because of the interface.
   b) Made ContainerStorageManager block for the restore future completion on 
the main thread instead of restoreExecutor.
    
   Tests: Verified manually that ContainerStorageManager does not block on 
restore executor now and that there is no deadlock if num threads <= num tasks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to