[
https://issues.apache.org/jira/browse/SAMZA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shekhar Sharma reassigned SAMZA-2787:
-------------------------------------
Assignee: Shekhar Sharma
> Add GetDeleted API to Blob Store backup and restore managers and recover from
> DeletedException
> ----------------------------------------------------------------------------------------------
>
> Key: SAMZA-2787
> URL: https://issues.apache.org/jira/browse/SAMZA-2787
> Project: Samza
> Issue Type: Improvement
> Reporter: Shekhar Sharma
> Assignee: Shekhar Sharma
> Priority: Major
>
> Problem Statement:
> * Yarn can sometimes create orphaned containers. In our production systems,
> we noticed that there were overlapping Samza containers running/committing at
> the same time.
> * If the stores are backed up to a blob store, this orphaned and overlapping
> container may delete a blob (which is common during delta state calculation
> in commit lifecycle with blob store backend). The other non-orphaned
> container may expect this blob to be present.
> * This causes the container and subsequently the job to fail. During this,
> the container fails with DeletedException - which is Blob store's response
> that the blob was present but is gone now.
> Fix:
> * During commit, if a container fails with DeletedException, let the
> container fail/restart.
> * During the recovery phase of the restart, get the deleted blob with get()
> call with getDeleted flag that indicates that if the blob is marked for
> deletion but not yet compacted, blob store should return it.
> * Recreate the new blob by uploading it to blob store afresh. Use the new
> blob id received to create a new checkpoint.
> * Write this new checkpoint to the checkpoint topic.
> * After this, and as long as orphaned container is not cleaned up by Yarn,
> the container should be able to commit regulary.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)