[jira] [Assigned] (SAMZA-2787) Add GetDeleted API to Blob Store backup and restore managers and recover from DeletedException

Shekhar Sharma (Jira) Wed, 26 Jul 2023 14:12:07 -0700


     [ 
https://issues.apache.org/jira/browse/SAMZA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shekhar Sharma reassigned SAMZA-2787:
-------------------------------------

    Assignee: Shekhar Sharma

> Add GetDeleted API to Blob Store backup and restore managers and recover from 
> DeletedException
> ----------------------------------------------------------------------------------------------
>
>                 Key: SAMZA-2787
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2787
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Shekhar Sharma
>            Assignee: Shekhar Sharma
>            Priority: Major
>
> Problem Statement:
>  * Yarn can sometimes create orphaned containers. In our production systems, 
> we noticed that there were overlapping Samza containers running/committing at 
> the same time.
>  * If the stores are backed up to a blob store, this orphaned and overlapping 
> container may delete a blob (which is common during delta state calculation 
> in commit lifecycle with blob store backend). The other non-orphaned 
> container may expect this blob to be present.
>  * This causes the container and subsequently the job to fail. During this, 
> the container fails with DeletedException - which is Blob store's response 
> that the blob was present but is gone now.
> Fix:
>  * During commit, if a container fails with DeletedException, let the 
> container fail/restart.
>  * During the recovery phase of the restart, get the deleted blob with get() 
> call with getDeleted flag that indicates that if the blob is marked for 
> deletion but not yet compacted, blob store should return it.
>  * Recreate the new blob by uploading it to blob store afresh. Use the new 
> blob id received to create a new checkpoint.
>  * Write this new checkpoint to the checkpoint topic.
>  * After this, and as long as orphaned container is not cleaned up by Yarn, 
> the container should be able to commit regulary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (SAMZA-2787) Add GetDeleted API to Blob Store backup and restore managers and recover from DeletedException

Reply via email to