Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Mich Talebzadeh Thu, 29 May 2025 15:47:52 -0700

I think from what I have seen there are a good number of +1 responses as
opposed to quantitative discussions (based on my observations only). Given
the objectives of the thread, we ought to focus on what is meant by real
time compared to continuous   modes.To be fair, it is a common point of
confusion, and the terms are often used interchangeably in general
conversation, but in technical contexts, especially with streaming data
platforms, they have specific and important differences.


"Continuous Mode" refers to a processing strategy that aims for true,
uninterrupted, sub-millisecond latency processing.  Chiefly

   - Event-at-a-Time (or very small  batch groups): The system processes
   individual events or extremely small groups of events -> microbatches as
   they flow through the pipeline.
   - Minimal Latency: The primary goal is to achieve the absolute lowest
   possible end-to-end latency, often in the order of milliseconds or even
   below
   - Most business use cases (say financial markets) can live with this as
   they do not rely on rdges

Now what is meant by "Real-time Mode"

This is where the nuance comes in. "Real-time" is a broader and sometimes
more subjective term. When the text introduces "Real-time Mode" as distinct
from "Continuous Mode," it suggests a specific implementation that achieves
real-time characteristics but might do so differently or more robustly than
a "continuous" mode attempt. Going back to my earlier mention, in real time
application , there is nothing as an answer which is supposed to be late
and correct. The timeliness is part of the application. if I get the
right answer too slowly it becomes useless or wrong. What I call the "Late
and Correct is Useless" Principle

In summary, "Real-time Mode" seems to describe an approach that delivers
low-latency processing with high reliability and ease of use, leveraging
established, battle-tested components.I invite the audience to have a
discussion on this.

HTH

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Thu, 29 May 2025 at 19:15, Yang Jie <[email protected]> wrote:

> +1
>
> On 2025/05/29 16:25:19 Xiao Li wrote:
> > +1
> >
> > Yuming Wang <[email protected]> 于2025年5月29日周四 02:22写道：
> >
> > > +1.
> > >
> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <[email protected]> wrote:
> > >
> > >> +1
> > >> Sent from my iPhone
> > >>
> > >> On May 29, 2025, at 12:15 AM, John Zhuge <[email protected]> wrote:
> > >>
> > >> 
> > >> +1 Nice feature
> > >>
> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <[email protected]>
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> Kent Yao <[email protected]> 于2025年5月28日周三 19:31写道：
> > >>>
> > >>>> +1, LGTM.
> > >>>>
> > >>>> Kent
> > >>>>
> > >>>> 在 2025年5月29日星期四，Chao Sun <[email protected]> 写道：
> > >>>>
> > >>>>> +1. Super excited by this initiative!
> > >>>>>
> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <[email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> +1
> > >>>>>>
> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
> [email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> +1
> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can
> > >>>>>>> eliminate the need for separate streaming engines, reducing
> system
> > >>>>>>> complexity and operational cost. Excited to see this direction!
> > >>>>>>>
> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
> > >>>>>>> [email protected]> wrote:
> > >>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> My point about "in real time application or data, there is
> nothing
> > >>>>>>>> as an answer which is supposed to be late and correct. The
> timeliness is
> > >>>>>>>> part of the application. if I get the right answer too slowly
> it becomes
> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need this
> > >>>>>>>> Spark Structured Streaming proposal.
> > >>>>>>>>
> > >>>>>>>> The proposal is precisely about enabling Spark to power
> > >>>>>>>> applications where, as I define it, the *timeliness* of the
> answer
> > >>>>>>>> is as critical as its *correctness*. Spark's current streaming
> > >>>>>>>> engine, primarily operating on micro-batches, often delivers
> results that
> > >>>>>>>> are technically "correct" but arrive too late to be truly
> useful for
> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them
> "useless or
> > >>>>>>>> wrong" in a practical, business-critical sense.
> > >>>>>>>>
> > >>>>>>>> For example *in real-time fraud detection* and In
> *high-frequency
> > >>>>>>>> trading,* market data or trade execution commands must be
> > >>>>>>>> delivered with minimal latency. Even a slight delay can mean
> missed
> > >>>>>>>> opportunities or significant financial losses, making a
> "correct" price
> > >>>>>>>> update useless if it's not instantaneous. able for these
> demanding
> > >>>>>>>> use cases, where a "late but correct" answer is simply not good
> enough. As
> > >>>>>>>> a colliery it is a fundamental concept, so it has to be treated
> as such not
> > >>>>>>>> as a comment.in SPIP
> > >>>>>>>>
> > >>>>>>>> Hope this clarifies the connection in practical terms
> > >>>>>>>> Dr Mich Talebzadeh,
> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
> > >>>>>>>> GDPR
> > >>>>>>>>
> > >>>>>>>>    view my Linkedin profile
> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <[email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hey Mich,
> > >>>>>>>>>
> > >>>>>>>>> Sorry, I may be missing something here but what does your
> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
> comments directly
> > >>>>>>>>> to the SPIP to provide context as the code snippet below is a
> direct copy
> > >>>>>>>>> from the SPIP itself.
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> Denny
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
> > >>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> just to add
> > >>>>>>>>>>
> > >>>>>>>>>> A stronger definition of real time. The engineering
> definition of
> > >>>>>>>>>> real time is roughly fast enough to be interactive
> > >>>>>>>>>>
> > >>>>>>>>>> However, I put a stronger definition. In real time
> application or
> > >>>>>>>>>> data, there is nothing as an answer which is supposed to be
> late and
> > >>>>>>>>>> correct. The timeliness is part of the application.if I get
> the right
> > >>>>>>>>>> answer too slowly it becomes useless or wrong
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Dr Mich Talebzadeh,
> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
> Analysis |
> > >>>>>>>>>> GDPR
> > >>>>>>>>>>
> > >>>>>>>>>>    view my Linkedin profile
> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
> > >>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> The current limitations in SSS come from micro-batching.If
> you
> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must be
> balanced against
> > >>>>>>>>>>> the available processing capacity of the cluster to prevent
> back pressure
> > >>>>>>>>>>> and instability. In the case of Continuous Processing mode, a
> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
> interval quote
> > >>>>>>>>>>>
> > >>>>>>>>>>> "
> > >>>>>>>>>>> df.writeStream
> > >>>>>>>>>>>    .format("...")
> > >>>>>>>>>>>    .option("...")
> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new
> trigger
> > >>>>>>>>>>> type to enable real-time Mode
> > >>>>>>>>>>>    .start()
> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run in
> the
> > >>>>>>>>>>> new ultra low-latency execution mode.  A time interval can
> also be
> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each
> micro-batch should
> > >>>>>>>>>>> run for.
> > >>>>>>>>>>> "
> > >>>>>>>>>>>
> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple
> > >>>>>>>>>>> HTH
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Dr Mich Talebzadeh,
> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
> Analysis |
> > >>>>>>>>>>> GDPR
> > >>>>>>>>>>>
> > >>>>>>>>>>>    view my Linkedin profile
> > >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
> > >>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that
> I've been
> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
> Jungtaek Lim, and
> > >>>>>>>>>>>> Michael Armbrust: [JIRA
> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
> > >>>>>>>>>>>> <
> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
> >
> > >>>>>>>>>>>> ].
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time
> Mode”
> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers
> end-to-end latency
> > >>>>>>>>>>>> for processing streams of data.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our goal
> is
> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that need
> results almost
> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
> achieve this without
> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users
> already use – so
> > >>>>>>>>>>>> existing streaming queries can run in this new
> ultra-low-latency mode by
> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
> personalization) that
> > >>>>>>>>>>>> today cannot meet their latency requirements with Spark’s
> current streaming
> > >>>>>>>>>>>> engine.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
> > >>>>>>>>>>>> suggestions on this approach!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Best,
> > >>>>>> Yanbo
> > >>>>>>
> > >>>>>
> > >>
> > >> --
> > >> John Zhuge
> > >>
> > >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to