/cc dev@flink
On Tue, Apr 20, 2021 at 1:29 AM Sonam Mandal <soman...@linkedin.com> wrote: > Hello, > > We've been experimenting with Task-local recovery using Kubernetes. We > have a way to specify mounting the same disk across Task Manager > restarts/deletions for when the pods get recreated. In this scenario, we > noticed that task local recovery does not kick in (as expected based on the > documentation). > > We did try to comment out the code on the shutdown path which cleaned up > the task local directories before the pod went down / was restarted. We > noticed that remote recovery kicked in even though the task local state was > present. I noticed that the slot IDs changed, and was wondering if this is > the main reason that the task local state didn't get used in this scenario? > > Since we're using this shared disk to store the local state across pod > failures, would it make sense to allow keeping the task local state so that > we can get faster recovery even for situations where the Task Manager > itself dies? In some sense, the storage here is disaggregated from the pods > and can potentially benefit from task local recovery. Any reason why this > is a bad idea in general? > > Is there a way to preserve the slot IDs across restarts? We setup the Task > Manager to pin the resource-id, but that didn't seem to help. My > understanding is that the slot ID needs to be reused for task local > recovery to kick in. > > Thanks, > Sonam > >