Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Jerry Peng Thu, 29 May 2025 22:28:20 -0700

Mark,

As an example of my point if you go the the Apache Storm (another stream
processing engine) website:


https://storm.apache.org/

It describes Storm as:

"Apache Storm is a free and open source distributed *realtime* computation
system"

If you can to apache Flink:

https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/

"Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"

Thus, what the term "rea-time" implies in this should not be confusing for
folks in this area.

On Thu, May 29, 2025 at 10:22 PM Jerry Peng <jerry.boyang.p...@gmail.com>
wrote:

> Mich,
>
> If I understood your last email correctly, I think you also wanted to have
> a discussion about naming?  Why are we calling this new execution mode
> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
> "continuous mode" is taken and we want another name to describe an
> execution mode that provides ultra low latency processing.  We could have
> called it "low latency mode", though I don't really like that naming since
> it implies the other execution modes are not low latency which I don't
> believe is true.  This new proposed mode can simply deliver even lower
> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
> are talking about "soft" real-time here.  I think when we are talking about
> distributed stream processing systems in the space of big data analytics,
> it is reasonable to assume anything described in this space as "real-time"
> implies "soft" real-time.  Though if this is confusing or misleading, we
> can provide clear documentation on what "real-time" in real-time mode means
> and what it guarantees.  Just my thoughts.  I would love to hear other
> perspectives.
>
> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> I think from what I have seen there are a good number of +1 responses as
>> opposed to quantitative discussions (based on my observations only). Given
>> the objectives of the thread, we ought to focus on what is meant by real
>> time compared to continuous   modes.To be fair, it is a common point of
>> confusion, and the terms are often used interchangeably in general
>> conversation, but in technical contexts, especially with streaming data
>> platforms, they have specific and important differences.
>>
>> "Continuous Mode" refers to a processing strategy that aims for true,
>> uninterrupted, sub-millisecond latency processing.  Chiefly
>>
>>    - Event-at-a-Time (or very small  batch groups): The system processes
>>    individual events or extremely small groups of events -> microbatches as
>>    they flow through the pipeline.
>>    - Minimal Latency: The primary goal is to achieve the absolute lowest
>>    possible end-to-end latency, often in the order of milliseconds or even
>>    below
>>    - Most business use cases (say financial markets) can live with this
>>    as they do not rely on rdges
>>
>> Now what is meant by "Real-time Mode"
>>
>> This is where the nuance comes in. "Real-time" is a broader and sometimes
>> more subjective term. When the text introduces "Real-time Mode" as distinct
>> from "Continuous Mode," it suggests a specific implementation that achieves
>> real-time characteristics but might do so differently or more robustly than
>> a "continuous" mode attempt. Going back to my earlier mention, in real time
>> application , there is nothing as an answer which is supposed to be late
>> and correct. The timeliness is part of the application. if I get the
>> right answer too slowly it becomes useless or wrong. What I call the "Late
>> and Correct is Useless" Principle
>>
>> In summary, "Real-time Mode" seems to describe an approach that delivers
>> low-latency processing with high reliability and ease of use, leveraging
>> established, battle-tested components.I invite the audience to have a
>> discussion on this.
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote:
>>
>>> +1
>>>
>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>> > +1
>>> >
>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：
>>> >
>>> > > +1.
>>> > >
>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote:
>>> > >
>>> > >> +1
>>> > >> Sent from my iPhone
>>> > >>
>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote:
>>> > >>
>>> > >> 
>>> > >> +1 Nice feature
>>> > >>
>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com
>>> >
>>> > >> wrote:
>>> > >>
>>> > >>> +1
>>> > >>>
>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
>>> > >>>
>>> > >>>> +1, LGTM.
>>> > >>>>
>>> > >>>> Kent
>>> > >>>>
>>> > >>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>>> > >>>>
>>> > >>>>> +1. Super excited by this initiative!
>>> > >>>>>
>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>>> +1
>>> > >>>>>>
>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>> huaxin.ga...@gmail.com>
>>> > >>>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> +1
>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>> > >>>>>>> eliminate the need for separate streaming engines, reducing
>>> system
>>> > >>>>>>> complexity and operational cost. Excited to see this direction!
>>> > >>>>>>>
>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>> > >>>>>>> mich.talebza...@gmail.com> wrote:
>>> > >>>>>>>
>>> > >>>>>>>> Hi,
>>> > >>>>>>>>
>>> > >>>>>>>> My point about "in real time application or data, there is
>>> nothing
>>> > >>>>>>>> as an answer which is supposed to be late and correct. The
>>> timeliness is
>>> > >>>>>>>> part of the application. if I get the right answer too slowly
>>> it becomes
>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need
>>> this
>>> > >>>>>>>> Spark Structured Streaming proposal.
>>> > >>>>>>>>
>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>> > >>>>>>>> applications where, as I define it, the *timeliness* of the
>>> answer
>>> > >>>>>>>> is as critical as its *correctness*. Spark's current streaming
>>> > >>>>>>>> engine, primarily operating on micro-batches, often delivers
>>> results that
>>> > >>>>>>>> are technically "correct" but arrive too late to be truly
>>> useful for
>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them
>>> "useless or
>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>> > >>>>>>>>
>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>> *high-frequency
>>> > >>>>>>>> trading,* market data or trade execution commands must be
>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can mean
>>> missed
>>> > >>>>>>>> opportunities or significant financial losses, making a
>>> "correct" price
>>> > >>>>>>>> update useless if it's not instantaneous. able for these
>>> demanding
>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not
>>> good enough. As
>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>> treated as such not
>>> > >>>>>>>> as a comment.in SPIP
>>> > >>>>>>>>
>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>> > >>>>>>>> Dr Mich Talebzadeh,
>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>> Analysis |
>>> > >>>>>>>> GDPR
>>> > >>>>>>>>
>>> > >>>>>>>>    view my Linkedin profile
>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>> denny.g....@gmail.com>
>>> > >>>>>>>> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>>> Hey Mich,
>>> > >>>>>>>>>
>>> > >>>>>>>>> Sorry, I may be missing something here but what does your
>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
>>> comments directly
>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below is
>>> a direct copy
>>> > >>>>>>>>> from the SPIP itself.
>>> > >>>>>>>>>
>>> > >>>>>>>>> Thanks,
>>> > >>>>>>>>> Denny
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>>> just to add
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>> definition of
>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>> application or
>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to be
>>> late and
>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I get
>>> the right
>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>> Analysis |
>>> > >>>>>>>>>> GDPR
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>    view my Linkedin profile
>>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>>> The current limitations in SSS come from micro-batching.If
>>> you
>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must be
>>> balanced against
>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>> prevent back pressure
>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing
>>> mode, a
>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
>>> interval quote
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> "
>>> > >>>>>>>>>>> df.writeStream
>>> > >>>>>>>>>>>    .format("...")
>>> > >>>>>>>>>>>    .option("...")
>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new
>>> trigger
>>> > >>>>>>>>>>> type to enable real-time Mode
>>> > >>>>>>>>>>>    .start()
>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run in
>>> the
>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time interval can
>>> also be
>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each
>>> micro-batch should
>>> > >>>>>>>>>>> run for.
>>> > >>>>>>>>>>> "
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>> > >>>>>>>>>>> HTH
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>> Analysis |
>>> > >>>>>>>>>>> GDPR
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>    view my Linkedin profile
>>> > >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> Hi all,
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming”
>>> that I've been
>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
>>> Jungtaek Lim, and
>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>> > >>>>>>>>>>>> <
>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>> >
>>> > >>>>>>>>>>>> ].
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time
>>> Mode”
>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers
>>> end-to-end latency
>>> > >>>>>>>>>>>> for processing streams of data.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our
>>> goal is
>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that
>>> need results almost
>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
>>> achieve this without
>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users
>>> already use – so
>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>> ultra-low-latency mode by
>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>> personalization) that
>>> > >>>>>>>>>>>> today cannot meet their latency requirements with Spark’s
>>> current streaming
>>> > >>>>>>>>>>>> engine.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>> > >>>>>>>>>>>> suggestions on this approach!
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>>
>>> > >>>>>>
>>> > >>>>>> --
>>> > >>>>>> Best,
>>> > >>>>>> Yanbo
>>> > >>>>>>
>>> > >>>>>
>>> > >>
>>> > >> --
>>> > >> John Zhuge
>>> > >>
>>> > >>
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to