Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Mark Hamstra Thu, 29 May 2025 22:38:10 -0700

Referencing other misuse of "real-time" is not persuasive. A SPIP is an
engineering document, not a marketing document. Technical clarity and
accuracy should be non-negotiable.



On Thu, May 29, 2025 at 10:27 PM Jerry Peng <jerry.boyang.p...@gmail.com>
wrote:

> Mark,
>
> As an example of my point if you go the the Apache Storm (another stream
> processing engine) website:
>
> https://storm.apache.org/
>
> It describes Storm as:
>
> "Apache Storm is a free and open source distributed *realtime*
> computation system"
>
> If you can to apache Flink:
>
>
> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>
> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>
> Thus, what the term "rea-time" implies in this should not be confusing for
> folks in this area.
>
> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <jerry.boyang.p...@gmail.com>
> wrote:
>
>> Mich,
>>
>> If I understood your last email correctly, I think you also wanted to
>> have a discussion about naming?  Why are we calling this new execution mode
>> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
>> "continuous mode" is taken and we want another name to describe an
>> execution mode that provides ultra low latency processing.  We could have
>> called it "low latency mode", though I don't really like that naming since
>> it implies the other execution modes are not low latency which I don't
>> believe is true.  This new proposed mode can simply deliver even lower
>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
>> are talking about "soft" real-time here.  I think when we are talking about
>> distributed stream processing systems in the space of big data analytics,
>> it is reasonable to assume anything described in this space as "real-time"
>> implies "soft" real-time.  Though if this is confusing or misleading, we
>> can provide clear documentation on what "real-time" in real-time mode means
>> and what it guarantees.  Just my thoughts.  I would love to hear other
>> perspectives.
>>
>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I think from what I have seen there are a good number of +1 responses as
>>> opposed to quantitative discussions (based on my observations only). Given
>>> the objectives of the thread, we ought to focus on what is meant by real
>>> time compared to continuous   modes.To be fair, it is a common point of
>>> confusion, and the terms are often used interchangeably in general
>>> conversation, but in technical contexts, especially with streaming data
>>> platforms, they have specific and important differences.
>>>
>>> "Continuous Mode" refers to a processing strategy that aims for true,
>>> uninterrupted, sub-millisecond latency processing.  Chiefly
>>>
>>>    - Event-at-a-Time (or very small  batch groups): The system
>>>    processes individual events or extremely small groups of events ->
>>>    microbatches as they flow through the pipeline.
>>>    - Minimal Latency: The primary goal is to achieve the absolute
>>>    lowest possible end-to-end latency, often in the order of milliseconds or
>>>    even below
>>>    - Most business use cases (say financial markets) can live with this
>>>    as they do not rely on rdges
>>>
>>> Now what is meant by "Real-time Mode"
>>>
>>> This is where the nuance comes in. "Real-time" is a broader and
>>> sometimes more subjective term. When the text introduces "Real-time Mode"
>>> as distinct from "Continuous Mode," it suggests a specific implementation
>>> that achieves real-time characteristics but might do so differently or more
>>> robustly than a "continuous" mode attempt. Going back to my earlier
>>> mention, in real time application , there is nothing as an answer which is
>>> supposed to be late and correct. The timeliness is part of the application.
>>> if I get the right answer too slowly it becomes useless or wrong. What I
>>> call the "Late and Correct is Useless" Principle
>>>
>>> In summary, "Real-time Mode" seems to describe an approach that delivers
>>> low-latency processing with high reliability and ease of use, leveraging
>>> established, battle-tested components.I invite the audience to have a
>>> discussion on this.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh,
>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote:
>>>
>>>> +1
>>>>
>>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>>> > +1
>>>> >
>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：
>>>> >
>>>> > > +1.
>>>> > >
>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote:
>>>> > >
>>>> > >> +1
>>>> > >> Sent from my iPhone
>>>> > >>
>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org>
>>>> wrote:
>>>> > >>
>>>> > >> 
>>>> > >> +1 Nice feature
>>>> > >>
>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <
>>>> xyliyuanj...@gmail.com>
>>>> > >> wrote:
>>>> > >>
>>>> > >>> +1
>>>> > >>>
>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
>>>> > >>>
>>>> > >>>> +1, LGTM.
>>>> > >>>>
>>>> > >>>> Kent
>>>> > >>>>
>>>> > >>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>>>> > >>>>
>>>> > >>>>> +1. Super excited by this initiative!
>>>> > >>>>>
>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com
>>>> >
>>>> > >>>>> wrote:
>>>> > >>>>>
>>>> > >>>>>> +1
>>>> > >>>>>>
>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>>> huaxin.ga...@gmail.com>
>>>> > >>>>>> wrote:
>>>> > >>>>>>
>>>> > >>>>>>> +1
>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>>> > >>>>>>> eliminate the need for separate streaming engines, reducing
>>>> system
>>>> > >>>>>>> complexity and operational cost. Excited to see this
>>>> direction!
>>>> > >>>>>>>
>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>> > >>>>>>> mich.talebza...@gmail.com> wrote:
>>>> > >>>>>>>
>>>> > >>>>>>>> Hi,
>>>> > >>>>>>>>
>>>> > >>>>>>>> My point about "in real time application or data, there is
>>>> nothing
>>>> > >>>>>>>> as an answer which is supposed to be late and correct. The
>>>> timeliness is
>>>> > >>>>>>>> part of the application. if I get the right answer too
>>>> slowly it becomes
>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need
>>>> this
>>>> > >>>>>>>> Spark Structured Streaming proposal.
>>>> > >>>>>>>>
>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of the
>>>> answer
>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current
>>>> streaming
>>>> > >>>>>>>> engine, primarily operating on micro-batches, often delivers
>>>> results that
>>>> > >>>>>>>> are technically "correct" but arrive too late to be truly
>>>> useful for
>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them
>>>> "useless or
>>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>>> > >>>>>>>>
>>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>>> *high-frequency
>>>> > >>>>>>>> trading,* market data or trade execution commands must be
>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can mean
>>>> missed
>>>> > >>>>>>>> opportunities or significant financial losses, making a
>>>> "correct" price
>>>> > >>>>>>>> update useless if it's not instantaneous. able for these
>>>> demanding
>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not
>>>> good enough. As
>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>>> treated as such not
>>>> > >>>>>>>> as a comment.in SPIP
>>>> > >>>>>>>>
>>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>>> > >>>>>>>> Dr Mich Talebzadeh,
>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>> Analysis |
>>>> > >>>>>>>> GDPR
>>>> > >>>>>>>>
>>>> > >>>>>>>>    view my Linkedin profile
>>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>>
>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>>> denny.g....@gmail.com>
>>>> > >>>>>>>> wrote:
>>>> > >>>>>>>>
>>>> > >>>>>>>>> Hey Mich,
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your
>>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
>>>> comments directly
>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below is
>>>> a direct copy
>>>> > >>>>>>>>> from the SPIP itself.
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> Thanks,
>>>> > >>>>>>>>> Denny
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>
>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>> > >>>>>>>>>
>>>> > >>>>>>>>>> just to add
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>>> definition of
>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>>> application or
>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to
>>>> be late and
>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I
>>>> get the right
>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>> Analysis |
>>>> > >>>>>>>>>> GDPR
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>    view my Linkedin profile
>>>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>> > >>>>>>>>>>
>>>> > >>>>>>>>>>> The current limitations in SSS come from
>>>> micro-batching.If you
>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must
>>>> be balanced against
>>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>>> prevent back pressure
>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing
>>>> mode, a
>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
>>>> interval quote
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>> "
>>>> > >>>>>>>>>>> df.writeStream
>>>> > >>>>>>>>>>>    .format("...")
>>>> > >>>>>>>>>>>    .option("...")
>>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new
>>>> trigger
>>>> > >>>>>>>>>>> type to enable real-time Mode
>>>> > >>>>>>>>>>>    .start()
>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run
>>>> in the
>>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time interval
>>>> can also be
>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each
>>>> micro-batch should
>>>> > >>>>>>>>>>> run for.
>>>> > >>>>>>>>>>> "
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>> > >>>>>>>>>>> HTH
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>> Analysis |
>>>> > >>>>>>>>>>> GDPR
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>    view my Linkedin profile
>>>> > >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/
>>>> >
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>> > >>>>>>>>>>>
>>>> > >>>>>>>>>>>> Hi all,
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming”
>>>> that I've been
>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
>>>> Jungtaek Lim, and
>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>]
>>>> [Doc
>>>> > >>>>>>>>>>>> <
>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>>> >
>>>> > >>>>>>>>>>>> ].
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time
>>>> Mode”
>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers
>>>> end-to-end latency
>>>> > >>>>>>>>>>>> for processing streams of data.
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our
>>>> goal is
>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that
>>>> need results almost
>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
>>>> achieve this without
>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users
>>>> already use – so
>>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>>> ultra-low-latency mode by
>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>>> personalization) that
>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with
>>>> Spark’s current streaming
>>>> > >>>>>>>>>>>> engine.
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>> > >>>>>>>>>>>> suggestions on this approach!
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>>>>>>>
>>>> > >>>>>>
>>>> > >>>>>> --
>>>> > >>>>>> Best,
>>>> > >>>>>> Yanbo
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>
>>>> > >> --
>>>> > >> John Zhuge
>>>> > >>
>>>> > >>
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to