shekhars-li opened a new pull request #1501:
URL: https://github.com/apache/samza/pull/1501
Problem:
State restore for Samza jobs that have large states takes up a long time,
sometimes upto multiple hours because of one message at a time restore for
kafka backed changelog. We introduce blob store based backup and restore for
Samza, that can do parallel backup and restore and solve the problem discussed
earlier.
Solution:
1. Add Blob store under samza-core/storage: Supports the blob store based
backup and restore.
1. index - Index classes defines a way of accessing a remote file or
subdirectory and the metadata associated with it.
1. FileBlob -> Representation of a blob in blob store with a blob
id and offset.
2. FileIndex -> Representation of a file in blob store as a set of
FileBlobs and it's metadata like file name, size, permissions etc.
3. DirIndex -> Representation of a directory (and it's
sub-directories) in the blob store bucket and metadata like directory name.
4. SnapshotIndex -> Representation of a snapshot directory and
metadata like job name, job id, store name etc.
2. DirDiff - DirDiff class represents the diff/delta between a local
snapshot and a remote snapshot. A corresponding util class DirDiffUtil is used
to calculate delta between local and remote snapshots.
2. Add BlobStoreManager APIs to samza-api/blobstore:
1. BlobStoreManager -> Interface to expose GET/PUT/DELETE API calls to a
blob store. A special removeTTL API call is introduced to help remove TTL of a
blob. Used in garbage collection, as explained in SAMZA-2657.
2. Metadata -> Metadata associated with a request to Blob store.
Contains job details, store name, payload details.
Other design related details can be found in the design doc attached with
the SAMZA-2657 tickets.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]