Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Jerry Peng Thu, 29 May 2025 10:45:05 -0700

Hi all,

A big thanks to everyone that provided feedback to the SPIP!  My co-authors
and I really appreciate it. I am excited to see this amount of interest in
the proposal. I am also glad to see all the support this initiative is
getting from the community.



Let me summarize some of the common questions and concerns folks commented
on in the SPIP google doc.

Common questions/concerns:

Why do you think Real-time mode will be more successful than Continuous
Mode?

I think there are a couple reasons that separate real-time mode and
continuous mode:

   1.

   Clear plan to support both stateless and stateful queries
   2.

   Reusing existing battle tested Structured Streaming components (e.g.,
   checkpointing, offset tracking, etc).
   3.

   Broad functionality support so that users can easily switch between
   real-time mode and other execution modes.  This lowers the barrier to entry
   for using real-time mode.


Is Real-time Mode going to replace Micro-batch Mode?

The existing execution modes i.e. ProcessingTime and AvailableNow still
have their own merits.  It is easy to control the cost when running queries
in those modes.  It is conceivable that some use cases may migrate to
predominately use real-time mode though I would imagine cost sensitive /
prioritized workloads will remain on the existing execution modes.

What are the processing guarantees and failure semantics for Real-time Mode?

The processing semantics remains exactly-once.  This means updates to
internally managed state will only be reflected once.

In Real-time mode when a task fails, we will fail the query instead of
retrying the task.  This needs to happen for queries that contain a
"streaming shuffle" i.e. multi-stage queries.  The streaming shuffle does
not persist shuffle data, it simply sends data point to point, thus tasks
that read from a shuffle cannot simply restart.  In the future we can
explore shuffle mechanisms that persist the shuffle data, e.g. persistent
queue, but something like that will incur additional infrastructure costs.

How will watermarks be updated in Real-time Mode?

We plan to build a new watermark propagation mechanism so watermark or
control messages can flow interleaved with data messages in the query. This
will allow event time information to reach operators sooner than batch
boundaries.  A separate design document will be created to explain the
details.  Please note that this feature is only really needed for
functionality that depends on advancement of event time to output results,
such as queries running in Append Mode. Best,


Jerry


On Thu, May 29, 2025 at 9:25 AM Xiao Li <gatorsm...@gmail.com> wrote:

> +1
>
> Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：
>
>> +1.
>>
>> On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote:
>>
>>> +1
>>> Sent from my iPhone
>>>
>>> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote:
>>>
>>> 
>>> +1 Nice feature
>>>
>>> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
>>>>
>>>>> +1, LGTM.
>>>>>
>>>>> Kent
>>>>>
>>>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>>>>>
>>>>>> +1. Super excited by this initiative!
>>>>>>
>>>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>>>>>>> eliminate the need for separate streaming engines, reducing system
>>>>>>>> complexity and operational cost. Excited to see this direction!
>>>>>>>>
>>>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> My point about "in real time application or data, there is nothing
>>>>>>>>> as an answer which is supposed to be late and correct. The timeliness 
>>>>>>>>> is
>>>>>>>>> part of the application. if I get the right answer too slowly it 
>>>>>>>>> becomes
>>>>>>>>> useless or wrong" is actually fundamental to *why* we need this
>>>>>>>>> Spark Structured Streaming proposal.
>>>>>>>>>
>>>>>>>>> The proposal is precisely about enabling Spark to power
>>>>>>>>> applications where, as I define it, the *timeliness* of the
>>>>>>>>> answer is as critical as its *correctness*. Spark's current
>>>>>>>>> streaming engine, primarily operating on micro-batches, often delivers
>>>>>>>>> results that are technically "correct" but arrive too late to be truly
>>>>>>>>> useful for certain high-stakes, real-time scenarios. This makes them
>>>>>>>>> "useless or wrong" in a practical, business-critical sense.
>>>>>>>>>
>>>>>>>>> For example *in real-time fraud detection* and In *high-frequency
>>>>>>>>> trading,* market data or trade execution commands must be
>>>>>>>>> delivered with minimal latency. Even a slight delay can mean missed
>>>>>>>>> opportunities or significant financial losses, making a "correct" 
>>>>>>>>> price
>>>>>>>>> update useless if it's not instantaneous. able for these
>>>>>>>>> demanding use cases, where a "late but correct" answer is simply not 
>>>>>>>>> good
>>>>>>>>> enough. As a colliery it is a fundamental concept, so it has to be 
>>>>>>>>> treated
>>>>>>>>> as such not as a comment.in SPIP
>>>>>>>>>
>>>>>>>>> Hope this clarifies the connection in practical terms
>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>> GDPR
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Mich,
>>>>>>>>>>
>>>>>>>>>> Sorry, I may be missing something here but what does your
>>>>>>>>>> definition here have to do with the SPIP?   Perhaps add comments 
>>>>>>>>>> directly
>>>>>>>>>> to the SPIP to provide context as the code snippet below is a direct 
>>>>>>>>>> copy
>>>>>>>>>> from the SPIP itself.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Denny
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> just to add
>>>>>>>>>>>
>>>>>>>>>>> A stronger definition of real time. The engineering definition
>>>>>>>>>>> of real time is roughly fast enough to be interactive
>>>>>>>>>>>
>>>>>>>>>>> However, I put a stronger definition. In real time application
>>>>>>>>>>> or data, there is nothing as an answer which is supposed to be late 
>>>>>>>>>>> and
>>>>>>>>>>> correct. The timeliness is part of the application.if I get the 
>>>>>>>>>>> right
>>>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>>>> GDPR
>>>>>>>>>>>
>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The current limitations in SSS come from micro-batching.If you
>>>>>>>>>>>> are going to reduce micro-batching, this reduction must be 
>>>>>>>>>>>> balanced against
>>>>>>>>>>>> the available processing capacity of the cluster to prevent back 
>>>>>>>>>>>> pressure
>>>>>>>>>>>> and instability. In the case of Continuous Processing mode, a
>>>>>>>>>>>> specific continuous trigger with a desired checkpoint interval 
>>>>>>>>>>>> quote
>>>>>>>>>>>>
>>>>>>>>>>>> "
>>>>>>>>>>>> df.writeStream
>>>>>>>>>>>>    .format("...")
>>>>>>>>>>>>    .option("...")
>>>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger
>>>>>>>>>>>> type to enable real-time Mode
>>>>>>>>>>>>    .start()
>>>>>>>>>>>> This Trigger.RealTime signals that the query should run in the
>>>>>>>>>>>> new ultra low-latency execution mode.  A time interval can also be
>>>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each 
>>>>>>>>>>>> micro-batch should
>>>>>>>>>>>> run for.
>>>>>>>>>>>> "
>>>>>>>>>>>>
>>>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>>>>>>>>> HTH
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis
>>>>>>>>>>>> | GDPR
>>>>>>>>>>>>
>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've 
>>>>>>>>>>>>> been
>>>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek 
>>>>>>>>>>>>> Lim, and
>>>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>>>>>>>>> ].
>>>>>>>>>>>>>
>>>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode”
>>>>>>>>>>>>> in Spark Structured Streaming that significantly lowers 
>>>>>>>>>>>>> end-to-end latency
>>>>>>>>>>>>> for processing streams of data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is
>>>>>>>>>>>>> to make Spark capable of handling streaming jobs that need 
>>>>>>>>>>>>> results almost
>>>>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this 
>>>>>>>>>>>>> without
>>>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already 
>>>>>>>>>>>>> use – so
>>>>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency 
>>>>>>>>>>>>> mode by
>>>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>>>>>>>>> applications (like instant anomaly alerts or live 
>>>>>>>>>>>>> personalization) that
>>>>>>>>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>>>>>>>>> streaming
>>>>>>>>>>>>> engine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>>>>>>>>>> suggestions on this approach!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best,
>>>>>>> Yanbo
>>>>>>>
>>>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to