chaoqin-li1123 opened a new pull request, #41099: URL: https://github.com/apache/spark/pull/41099
### What changes were proposed in this pull request? In order to reduce the checkpoint duration and end to end latency, we propose to 1. During state commit, make the state of a microbatch durable by syncing the changelog instead of the state snapshot to the checkpoint directory. 2. Upload snapshot in the background to enable changelog purging and faster failure recovery. ### Why are the changes needed? We have identified state checkpointing latency as one of the major performance bottlenecks for stateful streaming queries. Currently, RocksDB state store pauses the RocksDB instances to upload a snapshot to the cloud when committing a batch, which is heavy weight and has unpredictable performance. With changelog based checkpointing, we allow the RocksDB instance to run uninterruptibly, which improves RocksDB operation performance. This also dramatically reduces the commit time and batch duration because we are uploading a smaller amount of data during state commit. With this change, stateful query with RocksDB state store will have lower and more predictable latency. ### How was this patch tested? Add unit test for changelog checkpointing utility. Add unit test and integration test that check backward compatibility with existing checkpoint. Enable RocksDB state store unit test and stateful streaming query integration test to run with changelog checkpointing enabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
