prateekm opened a new pull request, #1657:
URL: https://github.com/apache/samza/pull/1657
Feature: Allow bulk restore from blob store for side input stores.
If a blob store state backend is available, side input store restores can be
sped up significantly by bulk-restoring initial state from the blob store, and
only using the Kafka side input topic to catch up on the delta since last
checkpoint.
Additional Context for Reviewers:
Side inputs RunLoop and regular Task RunLoop are separate and commit
independently. SideInputTask commit flushes the the side input store and writes
the SIDE-INPUT-OFFSETS file in the store directory, but does not create a store
checkpoint or upload it to the state backend. This SIDE-INPUT-OFFSETS file is
used during restore to determine the starting offset in the side input topic.
Regular Task commit creates store checkpoints, uploads them to the state
backends, and saves the resulting store StateCheckpointMarker in the task
checkpoint, but currently does not copy the SIDE-INPUT-OFFSETS file to the
checkpoint directry.
Changes:
This is a follow up to #1654 and #1655
1. Copy the SIDE-INPUT-OFFSETS file (if exists) to the side input store
checkpoint directory created during regular Task commit, so that it can be
backed up along with the side input store contents and used during restore for
incremental catchup. This copying needs to be done under process-wide
synchronization to avoid data corruption, since the two RunLoops are currently
completely independent of each other and do not coordinate/synchronize their
commits, and there are no atomic file copy APIs.
2. If a blob store state backend factory is configured for side input
stores, use it to do an inital bulk restore in ContainerStorageManager before
starting the incremental restore and consumption from side input topics.
Tests: Added new integration tests to verify BlobStoreStateBackend
functionality in general, and the new BlobStoreStateBackend + Side Inputs
functionality in particular.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]