prateekm opened a new pull request, #1657:
URL: https://github.com/apache/samza/pull/1657

   Feature: Allow bulk restore from blob store for side input stores.
   If a blob store state backend is available, side input store restores can be 
sped up significantly by bulk-restoring initial state from the blob store, and 
only using the Kafka side input topic to catch up on the delta since last 
checkpoint. 
   
   Additional Context for Reviewers:
   Side inputs RunLoop and regular Task RunLoop are separate and commit 
independently. SideInputTask commit flushes the the side input store and writes 
the SIDE-INPUT-OFFSETS file in the store directory, but does not create a store 
checkpoint or upload it to the state backend. This SIDE-INPUT-OFFSETS file is 
used during restore to determine the starting offset in the side input topic. 
Regular Task commit creates store checkpoints, uploads them to the state 
backends, and saves the resulting store StateCheckpointMarker in the task 
checkpoint, but currently does not copy the SIDE-INPUT-OFFSETS file to the 
checkpoint directry.
    
   Changes:
   This is a follow up to #1654 and #1655 
   1. Copy the SIDE-INPUT-OFFSETS file (if exists) to the side input store 
checkpoint directory created during regular Task commit, so that it can be 
backed up along with the side input store contents and used during restore for 
incremental catchup. This copying needs to be done under process-wide 
synchronization to avoid data corruption, since the two RunLoops are currently 
completely independent of each other and do not coordinate/synchronize their 
commits, and there are no atomic file copy APIs.
   2. If a blob store state backend factory is configured for side input 
stores, use it to do an inital bulk restore in ContainerStorageManager before 
starting the incremental restore and consumption from side input topics.
    
   
Tests: Added new integration tests to verify BlobStoreStateBackend 
functionality in general, and the new BlobStoreStateBackend + Side Inputs 
functionality in particular.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to