shekhars-li opened a new pull request #1501:
URL: https://github.com/apache/samza/pull/1501


   Problem: 
   State restore for Samza jobs that have large states takes up a long time, 
sometimes upto multiple hours because of one message at a time restore for 
kafka backed changelog. We introduce blob store based backup and restore for 
Samza, that can do parallel backup and restore and solve the problem discussed 
earlier.
   
   Solution:
   1. Add Blob store under samza-core/storage: Supports the blob store based 
backup and restore. 
       1. index - Index classes defines a way of accessing a remote file or 
subdirectory and the metadata associated with it.
            1. FileBlob -> Representation of a blob in blob store with a blob 
id and offset.
            2. FileIndex -> Representation of a file in blob store as a set of 
FileBlobs and it's metadata like file name, size, permissions etc. 
           3. DirIndex -> Representation of a directory (and it's 
sub-directories) in the blob store bucket and metadata like directory name.
           4. SnapshotIndex -> Representation of a snapshot directory and 
metadata like job name, job id, store name etc.
        2. DirDiff - DirDiff class represents the diff/delta between a local 
snapshot and a remote snapshot. A corresponding util class DirDiffUtil is used 
to calculate delta between local and remote snapshots. 
   
   2. Add BlobStoreManager APIs to samza-api/blobstore: 
       1. BlobStoreManager -> Interface to expose GET/PUT/DELETE API calls to a 
blob store. A special removeTTL API call is introduced to help remove TTL of a 
blob. Used in garbage collection, as explained in SAMZA-2657. 
       2.  Metadata -> Metadata associated with a request to Blob store. 
Contains job details, store name, payload details. 
   
   Other design related details can be found in the design doc attached with 
the SAMZA-2657 tickets. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to