Re: Checkpointing in Spark Structured Streaming

Jungtaek Lim Mon, 22 Mar 2021 13:09:16 -0700

I see some points making async checkpoint be tricky to add in micro-batch;
one example is "end to end exactly-once", as the commit phase in sink for
the batch N can be run "after" the batch N + 1 has been started and write
for batch N + 1 can happen before committing batch N. state store
checkpoint is tied to task lifecycle instead of checkpoint phase, which is
also tricky to make it be async.

There may be still some spots to optimize on checkpointing though, one
example is SPARK-34383 [1]. I've figured out it helps to reduce latency on
checkpointing with object stores by 300+ ms per batch.

Btw, even though S3 is now strongly consistent, it doesn't mean it's HDFS
compatible as default implementation of SS checkpoint requires. Atomic
rename is not supported, as well as rename isn't just a change on metadata
(read from S3 and write to S3 again). Performance would be sub-optimal, and
Spark no longer be able to prevent concurrent streaming queries trying to
update to the same checkpoint which might possibly mess up the checkpoint.
You need to make sure there's only one streaming query running against a
specific checkpoint.

1. https://issues.apache.org/jira/browse/SPARK-34383

On Tue, Mar 23, 2021 at 1:55 AM Rohit Agrawal <ro...@tecton.ai> wrote:

> Hi,
>
> I have been experimenting with the Continuous mode and the Micro batch
> mode in Spark Structured Streaming. When enabling checkpoint to S3 instead
> of the local File System we see that Continuous mode has no change in
> latency (expected due to async checkpointing) however the Micro-batch mode
> experiences degradation likely due to sync checkpointing.
>
> Is there any way to get async checkpointing in the micro-batching mode as
> well to improve latency. Could that be done with custom checkpointing logic
> ? Any pointers / experiences in that direction would be helpful.
>

Re: Checkpointing in Spark Structured Streaming

Reply via email to