chaoqin-li1123 opened a new pull request, #41099:
URL: https://github.com/apache/spark/pull/41099

   ### What changes were proposed in this pull request?
   In order to reduce the checkpoint duration and end to end latency, we 
propose to
   1. During state commit, make the state of a microbatch durable by syncing 
the changelog instead of the state snapshot to the checkpoint directory.
   2. Upload snapshot in the background to enable changelog purging and faster 
failure recovery.
   
   
   ### Why are the changes needed?
   We have identified state checkpointing latency as one of the major 
performance bottlenecks for stateful streaming queries. Currently, RocksDB 
state store pauses the RocksDB instances to upload a snapshot to the cloud when 
committing a batch, which is heavy weight and has unpredictable performance.
   With changelog based checkpointing, we allow the RocksDB instance to run 
uninterruptibly, which improves RocksDB operation performance. This also 
dramatically reduces the commit time and batch duration because we are 
uploading a smaller amount of data during state commit. With this change, 
stateful query with RocksDB state store will have lower and more predictable 
latency.
   
   ### How was this patch tested?
   Add unit test for changelog checkpointing utility.
   Add unit test and integration test that check backward compatibility with 
existing checkpoint.
   Enable RocksDB state store unit test and stateful streaming query 
integration test to run with changelog checkpointing enabled.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to