/cc dev@flink

On Tue, Apr 20, 2021 at 1:29 AM Sonam Mandal <soman...@linkedin.com> wrote:

> Hello,
>
> We've been experimenting with Task-local recovery using Kubernetes. We
> have a way to specify mounting the same disk across Task Manager
> restarts/deletions for when the pods get recreated. In this scenario, we
> noticed that task local recovery does not kick in (as expected based on the
> documentation).
>
> We did try to comment out the code on the shutdown path which cleaned up
> the task local directories before the pod went down / was restarted. We
> noticed that remote recovery kicked in even though the task local state was
> present. I noticed that the slot IDs changed, and was wondering if this is
> the main reason that the task local state didn't get used in this scenario?
>
> Since we're using this shared disk to store the local state across pod
> failures, would it make sense to allow keeping the task local state so that
> we can get faster recovery even for situations where the Task Manager
> itself dies? In some sense, the storage here is disaggregated from the pods
> and can potentially benefit from task local recovery. Any reason why this
> is a bad idea in general?
>
> Is there a way to preserve the slot IDs across restarts? We setup the Task
> Manager to pin the resource-id, but that didn't seem to help. My
> understanding is that the slot ID needs to be reused for task local
> recovery to kick in.
>
> Thanks,
> Sonam
>
>

Reply via email to