[ https://issues.apache.org/jira/browse/SAMZA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shekhar Sharma reassigned SAMZA-2787: ------------------------------------- Assignee: Shekhar Sharma > Add GetDeleted API to Blob Store backup and restore managers and recover from > DeletedException > ---------------------------------------------------------------------------------------------- > > Key: SAMZA-2787 > URL: https://issues.apache.org/jira/browse/SAMZA-2787 > Project: Samza > Issue Type: Improvement > Reporter: Shekhar Sharma > Assignee: Shekhar Sharma > Priority: Major > > Problem Statement: > * Yarn can sometimes create orphaned containers. In our production systems, > we noticed that there were overlapping Samza containers running/committing at > the same time. > * If the stores are backed up to a blob store, this orphaned and overlapping > container may delete a blob (which is common during delta state calculation > in commit lifecycle with blob store backend). The other non-orphaned > container may expect this blob to be present. > * This causes the container and subsequently the job to fail. During this, > the container fails with DeletedException - which is Blob store's response > that the blob was present but is gone now. > Fix: > * During commit, if a container fails with DeletedException, let the > container fail/restart. > * During the recovery phase of the restart, get the deleted blob with get() > call with getDeleted flag that indicates that if the blob is marked for > deletion but not yet compacted, blob store should return it. > * Recreate the new blob by uploading it to blob store afresh. Use the new > blob id received to create a new checkpoint. > * Write this new checkpoint to the checkpoint topic. > * After this, and as long as orphaned container is not cleaned up by Yarn, > the container should be able to commit regulary. -- This message was sent by Atlassian Jira (v8.20.10#820010)