Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Jerry Peng Thu, 29 May 2025 22:43:59 -0700

Mark,

I thought we are simply discussing the naming of the mode?  Like I
mentioned, if you think simply calling this mode "real-time" mode may cause
confusion because "real-time" can mean other things in other fields, I can
clarify what we mean by "real-time" explicitly in the SPIP document and any
future documentation. That is not a problem and thank you for your feedback.


On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <[email protected]> wrote:

> Referencing other misuse of "real-time" is not persuasive. A SPIP is an
> engineering document, not a marketing document. Technical clarity and
> accuracy should be non-negotiable.
>
>
> On Thu, May 29, 2025 at 10:27 PM Jerry Peng <[email protected]>
> wrote:
>
>> Mark,
>>
>> As an example of my point if you go the the Apache Storm (another stream
>> processing engine) website:
>>
>> https://storm.apache.org/
>>
>> It describes Storm as:
>>
>> "Apache Storm is a free and open source distributed *realtime*
>> computation system"
>>
>> If you can to apache Flink:
>>
>>
>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>
>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>
>> Thus, what the term "rea-time" implies in this should not be confusing
>> for folks in this area.
>>
>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <[email protected]>
>> wrote:
>>
>>> Mich,
>>>
>>> If I understood your last email correctly, I think you also wanted to
>>> have a discussion about naming?  Why are we calling this new execution mode
>>> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
>>> "continuous mode" is taken and we want another name to describe an
>>> execution mode that provides ultra low latency processing.  We could have
>>> called it "low latency mode", though I don't really like that naming since
>>> it implies the other execution modes are not low latency which I don't
>>> believe is true.  This new proposed mode can simply deliver even lower
>>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
>>> are talking about "soft" real-time here.  I think when we are talking about
>>> distributed stream processing systems in the space of big data analytics,
>>> it is reasonable to assume anything described in this space as "real-time"
>>> implies "soft" real-time.  Though if this is confusing or misleading, we
>>> can provide clear documentation on what "real-time" in real-time mode means
>>> and what it guarantees.  Just my thoughts.  I would love to hear other
>>> perspectives.
>>>
>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> I think from what I have seen there are a good number of +1 responses
>>>> as opposed to quantitative discussions (based on my observations only).
>>>> Given the objectives of the thread, we ought to focus on what is meant by
>>>> real time compared to continuous   modes.To be fair, it is a common point
>>>> of confusion, and the terms are often used interchangeably in general
>>>> conversation, but in technical contexts, especially with streaming data
>>>> platforms, they have specific and important differences.
>>>>
>>>> "Continuous Mode" refers to a processing strategy that aims for true,
>>>> uninterrupted, sub-millisecond latency processing.  Chiefly
>>>>
>>>>    - Event-at-a-Time (or very small  batch groups): The system
>>>>    processes individual events or extremely small groups of events ->
>>>>    microbatches as they flow through the pipeline.
>>>>    - Minimal Latency: The primary goal is to achieve the absolute
>>>>    lowest possible end-to-end latency, often in the order of milliseconds 
>>>> or
>>>>    even below
>>>>    - Most business use cases (say financial markets) can live with
>>>>    this as they do not rely on rdges
>>>>
>>>> Now what is meant by "Real-time Mode"
>>>>
>>>> This is where the nuance comes in. "Real-time" is a broader and
>>>> sometimes more subjective term. When the text introduces "Real-time Mode"
>>>> as distinct from "Continuous Mode," it suggests a specific implementation
>>>> that achieves real-time characteristics but might do so differently or more
>>>> robustly than a "continuous" mode attempt. Going back to my earlier
>>>> mention, in real time application , there is nothing as an answer which is
>>>> supposed to be late and correct. The timeliness is part of the application.
>>>> if I get the right answer too slowly it becomes useless or wrong. What I
>>>> call the "Late and Correct is Useless" Principle
>>>>
>>>> In summary, "Real-time Mode" seems to describe an approach that
>>>> delivers low-latency processing with high reliability and ease of use,
>>>> leveraging established, battle-tested components.I invite the audience to
>>>> have a discussion on this.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh,
>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <[email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>>>> > +1
>>>>> >
>>>>> > Yuming Wang <[email protected]> 于2025年5月29日周四 02:22写道：
>>>>> >
>>>>> > > +1.
>>>>> > >
>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <[email protected]> wrote:
>>>>> > >
>>>>> > >> +1
>>>>> > >> Sent from my iPhone
>>>>> > >>
>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <[email protected]>
>>>>> wrote:
>>>>> > >>
>>>>> > >> 
>>>>> > >> +1 Nice feature
>>>>> > >>
>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <
>>>>> [email protected]>
>>>>> > >> wrote:
>>>>> > >>
>>>>> > >>> +1
>>>>> > >>>
>>>>> > >>> Kent Yao <[email protected]> 于2025年5月28日周三 19:31写道：
>>>>> > >>>
>>>>> > >>>> +1, LGTM.
>>>>> > >>>>
>>>>> > >>>> Kent
>>>>> > >>>>
>>>>> > >>>> 在 2025年5月29日星期四，Chao Sun <[email protected]> 写道：
>>>>> > >>>>
>>>>> > >>>>> +1. Super excited by this initiative!
>>>>> > >>>>>
>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <
>>>>> [email protected]>
>>>>> > >>>>> wrote:
>>>>> > >>>>>
>>>>> > >>>>>> +1
>>>>> > >>>>>>
>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>>>> [email protected]>
>>>>> > >>>>>> wrote:
>>>>> > >>>>>>
>>>>> > >>>>>>> +1
>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>>>> > >>>>>>> eliminate the need for separate streaming engines, reducing
>>>>> system
>>>>> > >>>>>>> complexity and operational cost. Excited to see this
>>>>> direction!
>>>>> > >>>>>>>
>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>> > >>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>
>>>>> > >>>>>>>> Hi,
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> My point about "in real time application or data, there is
>>>>> nothing
>>>>> > >>>>>>>> as an answer which is supposed to be late and correct. The
>>>>> timeliness is
>>>>> > >>>>>>>> part of the application. if I get the right answer too
>>>>> slowly it becomes
>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need
>>>>> this
>>>>> > >>>>>>>> Spark Structured Streaming proposal.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of the
>>>>> answer
>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current
>>>>> streaming
>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often
>>>>> delivers results that
>>>>> > >>>>>>>> are technically "correct" but arrive too late to be truly
>>>>> useful for
>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them
>>>>> "useless or
>>>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>>>> *high-frequency
>>>>> > >>>>>>>> trading,* market data or trade execution commands must be
>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can
>>>>> mean missed
>>>>> > >>>>>>>> opportunities or significant financial losses, making a
>>>>> "correct" price
>>>>> > >>>>>>>> update useless if it's not instantaneous. able for these
>>>>> demanding
>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not
>>>>> good enough. As
>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>>>> treated as such not
>>>>> > >>>>>>>> as a comment.in SPIP
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>>>> > >>>>>>>> Dr Mich Talebzadeh,
>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>> Analysis |
>>>>> > >>>>>>>> GDPR
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>    view my Linkedin profile
>>>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>>>> [email protected]>
>>>>> > >>>>>>>> wrote:
>>>>> > >>>>>>>>
>>>>> > >>>>>>>>> Hey Mich,
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your
>>>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
>>>>> comments directly
>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below
>>>>> is a direct copy
>>>>> > >>>>>>>>> from the SPIP itself.
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> Thanks,
>>>>> > >>>>>>>>> Denny
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>> > >>>>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>>>
>>>>> > >>>>>>>>>> just to add
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>>>> definition of
>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>>>> application or
>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to
>>>>> be late and
>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I
>>>>> get the right
>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>> Analysis |
>>>>> > >>>>>>>>>> GDPR
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>    view my Linkedin profile
>>>>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/
>>>>> >
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>> > >>>>>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>>>>
>>>>> > >>>>>>>>>>> The current limitations in SSS come from
>>>>> micro-batching.If you
>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must
>>>>> be balanced against
>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>>>> prevent back pressure
>>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing
>>>>> mode, a
>>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
>>>>> interval quote
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> "
>>>>> > >>>>>>>>>>> df.writeStream
>>>>> > >>>>>>>>>>>    .format("...")
>>>>> > >>>>>>>>>>>    .option("...")
>>>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new
>>>>> trigger
>>>>> > >>>>>>>>>>> type to enable real-time Mode
>>>>> > >>>>>>>>>>>    .start()
>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run
>>>>> in the
>>>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time interval
>>>>> can also be
>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each
>>>>> micro-batch should
>>>>> > >>>>>>>>>>> run for.
>>>>> > >>>>>>>>>>> "
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>> > >>>>>>>>>>> HTH
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>> Analysis |
>>>>> > >>>>>>>>>>> GDPR
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>    view my Linkedin profile
>>>>> > >>>>>>>>>>> <
>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>> > >>>>>>>>>>> [email protected]> wrote:
>>>>> > >>>>>>>>>>>
>>>>> > >>>>>>>>>>>> Hi all,
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming”
>>>>> that I've been
>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
>>>>> Jungtaek Lim, and
>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>]
>>>>> [Doc
>>>>> > >>>>>>>>>>>> <
>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>>>> >
>>>>> > >>>>>>>>>>>> ].
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called
>>>>> “Real-time Mode”
>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers
>>>>> end-to-end latency
>>>>> > >>>>>>>>>>>> for processing streams of data.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our
>>>>> goal is
>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that
>>>>> need results almost
>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
>>>>> achieve this without
>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that
>>>>> users already use – so
>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>>>> ultra-low-latency mode by
>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power
>>>>> real-time
>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>>>> personalization) that
>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with
>>>>> Spark’s current streaming
>>>>> > >>>>>>>>>>>> engine.
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>> > >>>>>>>>>>>> suggestions on this approach!
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>>>>>>>
>>>>> > >>>>>>
>>>>> > >>>>>> --
>>>>> > >>>>>> Best,
>>>>> > >>>>>> Yanbo
>>>>> > >>>>>>
>>>>> > >>>>>
>>>>> > >>
>>>>> > >> --
>>>>> > >> John Zhuge
>>>>> > >>
>>>>> > >>
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: [email protected]
>>>>>
>>>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to