Shekhar Sharma created SAMZA-2787:
-------------------------------------
Summary: Add GetDeleted API to Blob Store backup and restore
managers and recover from DeletedException
Key: SAMZA-2787
URL: https://issues.apache.org/jira/browse/SAMZA-2787
Project: Samza
Issue Type: Improvement
Reporter: Shekhar Sharma
Problem Statement:
* Yarn can sometimes create orphaned containers. In our production systems, we
noticed that there were overlapping Samza containers running/committing at the
same time.
* If the stores are backed up to a blob store, this orphaned and overlapping
container may delete a blob (which is common during delta state calculation in
commit lifecycle with blob store backend). The other non-orphaned container may
expect this blob to be present.
* This causes the container and subsequently the job to fail. During this, the
container fails with DeletedException - which is Blob store's response that the
blob was present but is gone now.
Fix:
* During commit, if a container fails with DeletedException, let the container
fail/restart.
* During the recovery phase of the restart, get the deleted blob with get()
call with getDeleted flag that indicates that if the blob is marked for
deletion but not yet compacted, blob store should return it.
* Recreate the new blob by uploading it to blob store afresh. Use the new blob
id received to create a new checkpoint.
* Write this new checkpoint to the checkpoint topic.
* After this, and as long as orphaned container is not cleaned up by Yarn, the
container should be able to commit regulary.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)