Re: Checkpointing in Spark Structured Streaming

Rohit Agrawal Mon, 22 Mar 2021 13:56:55 -0700

Thank you for the reply. For our use case, it's okay to not have
exactly-once semantics. Given this use case of not needing exactly-once
a) Is there any negative implications if one were to use a custom state
store provider which asynchronously committed under the hood
b) Is there any other option to achieve this without using a custom state
store provider ?


Rohit

On Mon, Mar 22, 2021 at 4:09 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> I see some points making async checkpoint be tricky to add in micro-batch;
> one example is "end to end exactly-once", as the commit phase in sink for
> the batch N can be run "after" the batch N + 1 has been started and write
> for batch N + 1 can happen before committing batch N. state store
> checkpoint is tied to task lifecycle instead of checkpoint phase, which is
> also tricky to make it be async.
>
> There may be still some spots to optimize on checkpointing though, one
> example is SPARK-34383 [1]. I've figured out it helps to reduce latency on
> checkpointing with object stores by 300+ ms per batch.
>
> Btw, even though S3 is now strongly consistent, it doesn't mean it's HDFS
> compatible as default implementation of SS checkpoint requires. Atomic
> rename is not supported, as well as rename isn't just a change on metadata
> (read from S3 and write to S3 again). Performance would be sub-optimal, and
> Spark no longer be able to prevent concurrent streaming queries trying to
> update to the same checkpoint which might possibly mess up the checkpoint.
> You need to make sure there's only one streaming query running against a
> specific checkpoint.
>
> 1. https://issues.apache.org/jira/browse/SPARK-34383
>
> On Tue, Mar 23, 2021 at 1:55 AM Rohit Agrawal <ro...@tecton.ai> wrote:
>
>> Hi,
>>
>> I have been experimenting with the Continuous mode and the Micro batch
>> mode in Spark Structured Streaming. When enabling checkpoint to S3 instead
>> of the local File System we see that Continuous mode has no change in
>> latency (expected due to async checkpointing) however the Micro-batch mode
>> experiences degradation likely due to sync checkpointing.
>>
>> Is there any way to get async checkpointing in the micro-batching mode as
>> well to improve latency. Could that be done with custom checkpointing logic
>> ? Any pointers / experiences in that direction would be helpful.
>>
>

Re: Checkpointing in Spark Structured Streaming

Reply via email to