Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Mark Hamstra Thu, 29 May 2025 22:10:28 -0700

It should not be assumed. In something called "real-time", it should be
very explicit what clock-time constraints are and are not guaranteed.



On Thu, May 29, 2025 at 10:00 PM Jerry Peng <jerry.boyang.p...@gmail.com>
wrote:

> It was kind of hard to see what mich's point was in the plethora of
> emails he sent :)
>
> In embedded systems, there is a concept of soft real-time and hard
> real-time.  For these stream processing systems built for big data
> analytics, it is assumed that we are talking about soft real-time.  Sure,
> there can be an argument as to why the name of this mode is not "low
> latency mode" but I honestly don't like naming.  This naming implies the
> existing execution modes are not low latency which is not true.  What
> defines "low" in low latency?  It is kind of relative. That is why the name
> Real-time mode is selected.
>
> On Thu, May 29, 2025 at 8:57 PM Mark Hamstra <markhams...@gmail.com>
> wrote:
>
>> I think you are missing his point. There is a fundamental difference
>> between low-latency computation and real-time computing. Is what is
>> described in the SPIP intended to provide results with real-time
>> guarantees, or is it a misnamed effort to achieve low-latency?
>>
>> On Thu, May 29, 2025 at 5:54 PM Jerry Peng <jerry.boyang.p...@gmail.com>
>> wrote:
>>
>>> Mich,
>>>
>>> Thank you for chiming in and providing insights into the importance of
>>> not only getting correct results but also timely results.  You are
>>> absolutely right that the reason why something like Real-time Mode is
>>> valuable is its ability to provide timely results for certain use cases
>>> that require users to react super quickly to data.  I can emphasize this
>>> point in the SPIP.
>>>
>>>
>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> I think from what I have seen there are a good number of +1 responses
>>>> as opposed to quantitative discussions (based on my observations only).
>>>> Given the objectives of the thread, we ought to focus on what is meant by
>>>> real time compared to continuous   modes.To be fair, it is a common point
>>>> of confusion, and the terms are often used interchangeably in general
>>>> conversation, but in technical contexts, especially with streaming data
>>>> platforms, they have specific and important differences.
>>>>
>>>> "Continuous Mode" refers to a processing strategy that aims for true,
>>>> uninterrupted, sub-millisecond latency processing.  Chiefly
>>>>
>>>>    - Event-at-a-Time (or very small  batch groups): The system
>>>>    processes individual events or extremely small groups of events ->
>>>>    microbatches as they flow through the pipeline.
>>>>    - Minimal Latency: The primary goal is to achieve the absolute
>>>>    lowest possible end-to-end latency, often in the order of milliseconds 
>>>> or
>>>>    even below
>>>>    - Most business use cases (say financial markets) can live with
>>>>    this as they do not rely on rdges
>>>>
>>>> Now what is meant by "Real-time Mode"
>>>>
>>>> This is where the nuance comes in. "Real-time" is a broader and
>>>> sometimes more subjective term. When the text introduces "Real-time Mode"
>>>> as distinct from "Continuous Mode," it suggests a specific implementation
>>>> that achieves real-time characteristics but might do so differently or more
>>>> robustly than a "continuous" mode attempt. Going back to my earlier
>>>> mention, in real time application , there is nothing as an answer which is
>>>> supposed to be late and correct. The timeliness is part of the application.
>>>> if I get the right answer too slowly it becomes useless or wrong. What I
>>>> call the "Late and Correct is Useless" Principle
>>>>
>>>> In summary, "Real-time Mode" seems to describe an approach that
>>>> delivers low-latency processing with high reliability and ease of use,
>>>> leveraging established, battle-tested components.I invite the audience to
>>>> have a discussion on this.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh,
>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>>>> > +1
>>>>> >
>>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：
>>>>> >
>>>>> > > +1.
>>>>> > >
>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote:
>>>>> > >
>>>>> > >> +1
>>>>> > >> Sent from my iPhone
>>>>> > >>
>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org>
>>>>> wrote:
>>>>> > >>
>>>>> > >> 
>>>>> > >> +1 Nice feature
>>>>> > >>
>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <
>>>>> xyliyuanj...@gmail.com>
>>>>> > >> wrote:
>>>>> > >>
>>>>> > >>> +1
>>>>> > >>>
>>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
>>>>> > >>>
>>>>> > >>>> +1, LGTM.
>>>>> > >>>>
>>>>> > >>>> Kent
>>>>> > >>>>
>>>>> > >>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>>>>> > >>>>
>>>>> > >>>>> +1. Super excited by this initiative!
>>>>> > >>>>>
>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <
>>>>> yblia...@gmail.com>
>>>>> > >>>>> wrote:
>>>>> > >>>>>
>>>>> > >>>>>> +1
>>>>> > >>>>>>
>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>>>> huaxin.ga...@gmail.com>
>>>>> > >>>>>> wrote:
>>>>> > >>>>>>
>>>>> > >>>>>>> +1
>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>>>> > >>>>>>> eliminate the need for separate streaming engines, reducing
>>>>> system
>>>>> > >>>>>>> complexity and operational cost. Excited to see this
>>>>> direction!
>>>>> > >>>>>>>
>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>> > >>>>>>> mich.talebza...@gmail.com> wrote:
>>>>> > >>>>>>>
>>>>> > >>>>>>>> Hi,
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> My point about "in real time application or data, there is
>>>>> nothing
>>>>> > >>>>>>>> as an answer which is supposed to be late and correct. The
>>>>> timeliness is
>>>>> > >>>>>>>> part of the application. if I get the right answer too
>>>>> slowly it becomes
>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need
>>>>> this
>>>>> > >>>>>>>> Spark Structured Streaming proposal.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of the
>>>>> answer
>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current
>>>>> streaming
>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often
>>>>> delivers results that
>>>>> > >>>>>>>> are technically "correct" but arrive too late to be truly
>>>>> useful for
>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them
>>>>> "useless or
>>>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>>>> *high-frequency
>>>>> > >>>>>>>> trading,* market data or trade execution commands must be
>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can
>>>>> mean missed
>>>>> > >>>>>>>> opportunities or significant financial losses, making a
>>>>> "correct" price
>>>>> > >>>>>>>> update useless if it's not instantaneous. able for these
>>>>> demanding
>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not
>>>>> good enough. As
>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>>>> treated as such not
>>>>> > >>>>>>>> as a comment.in SPIP
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>>>> > >>>>>>>> Dr Mich Talebzadeh,
>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>> Analysis |
>>>>> > >>>>>>>> GDPR
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>    view my Linkedin profile
>>>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>>>> denny.g....@gmail.com>
>>>>> > >>>>>>>> wrote:
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>> Hey Mich,
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your
>>>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
>>>>> comments directly
>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below
>>>>> is a direct copy
>>>>> > >>>>>>>>> from the SPIP itself.
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> Thanks,
>>>>> > >>>>>>>>> Denny
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>> just to add
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>>>> definition of
>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>>>> application or
>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to
>>>>> be late and
>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I
>>>>> get the right
>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>> Analysis |
>>>>> > >>>>>>>>>> GDPR
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>    view my Linkedin profile
>>>>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/
>>>>> >
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>> The current limitations in SSS come from
>>>>> micro-batching.If you
>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must
>>>>> be balanced against
>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>>>> prevent back pressure
>>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing
>>>>> mode, a
>>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
>>>>> interval quote
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> "
>>>>> > >>>>>>>>>>> df.writeStream
>>>>> > >>>>>>>>>>>    .format("...")
>>>>> > >>>>>>>>>>>    .option("...")
>>>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new
>>>>> trigger
>>>>> > >>>>>>>>>>> type to enable real-time Mode
>>>>> > >>>>>>>>>>>    .start()
>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run
>>>>> in the
>>>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time interval
>>>>> can also be
>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each
>>>>> micro-batch should
>>>>> > >>>>>>>>>>> run for.
>>>>> > >>>>>>>>>>> "
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>> > >>>>>>>>>>> HTH
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>> Analysis |
>>>>> > >>>>>>>>>>> GDPR
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>    view my Linkedin profile
>>>>> > >>>>>>>>>>> <
>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>> Hi all,
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming”
>>>>> that I've been
>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
>>>>> Jungtaek Lim, and
>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>]
>>>>> [Doc
>>>>> > >>>>>>>>>>>> <
>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>>>> >
>>>>> > >>>>>>>>>>>> ].
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called
>>>>> “Real-time Mode”
>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers
>>>>> end-to-end latency
>>>>> > >>>>>>>>>>>> for processing streams of data.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our
>>>>> goal is
>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that
>>>>> need results almost
>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
>>>>> achieve this without
>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that
>>>>> users already use – so
>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>>>> ultra-low-latency mode by
>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power
>>>>> real-time
>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>>>> personalization) that
>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with
>>>>> Spark’s current streaming
>>>>> > >>>>>>>>>>>> engine.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>> > >>>>>>>>>>>> suggestions on this approach!
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>
>>>>> > >>>>>> --
>>>>> > >>>>>> Best,
>>>>> > >>>>>> Yanbo
>>>>> > >>>>>>
>>>>> > >>>>>
>>>>> > >>
>>>>> > >> --
>>>>> > >> John Zhuge
>>>>> > >>
>>>>> > >>
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to