prateekm opened a new pull request #1570:
URL: https://github.com/apache/samza/pull/1570
Symptom: Container can deadlock during state restore on startup.
Cause: ContainerStorageManager kicks off TaskRestoreManager#restore and
blocks for restore to complete on the restoreExecutor. TaskRestoreManagers can
restore state asynchronously using ContainerStorageManager's restoreExecutor.
If TaskRestoreManagers schedule additional asynchronous tasks on the
restoreExecutor and block (Future#get or CompletableFuture#join) for them to
complete, it can cause a deadlock if num restore executor threads <= num tasks.
This is because all threads in restoreExecutor (if num threads <= num tasks)
would be blocked by TaskRestoreCallable's that are waiting for restore to
finish and the asynchronous work will never be executed. The workaround to keep
num threads > num tasks can be inefficient for containers with a large number
of tasks.
Changes:
a) Made TaskRestoreManager#restore return a future instead of blocking. Note
that restore managers must still take care to not block for completion for
futures scheduled on the restore executor. This PR makes it so they're not
forced to because of the interface.
b) Made ContainerStorageManager block for the restore future completion on
the main thread instead of restoreExecutor.
Tests: Verified manually that ContainerStorageManager does not block on
restore executor now and that there is no deadlock if num threads <= num tasks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]