Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Mark Hamstra Thu, 29 May 2025 23:45:39 -0700

Clarifying what is meant by "real-time" and explicitly differentiating it
from actual real-time computing should be a bare minimum. I still don't
like the use of marketing-speak "real-time" that isn't really real-time in
engineering documents or API namespaces.


On Thu, May 29, 2025 at 10:43 PM Jerry Peng <jerry.boyang.p...@gmail.com>
wrote:

> Mark,
>
> I thought we are simply discussing the naming of the mode?  Like I
> mentioned, if you think simply calling this mode "real-time" mode may cause
> confusion because "real-time" can mean other things in other fields, I can
> clarify what we mean by "real-time" explicitly in the SPIP document and any
> future documentation. That is not a problem and thank you for your feedback.
>
> On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <markhams...@gmail.com>
> wrote:
>
>> Referencing other misuse of "real-time" is not persuasive. A SPIP is an
>> engineering document, not a marketing document. Technical clarity and
>> accuracy should be non-negotiable.
>>
>>
>> On Thu, May 29, 2025 at 10:27 PM Jerry Peng <jerry.boyang.p...@gmail.com>
>> wrote:
>>
>>> Mark,
>>>
>>> As an example of my point if you go the the Apache Storm (another stream
>>> processing engine) website:
>>>
>>> https://storm.apache.org/
>>>
>>> It describes Storm as:
>>>
>>> "Apache Storm is a free and open source distributed *realtime*
>>> computation system"
>>>
>>> If you can to apache Flink:
>>>
>>>
>>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>>
>>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>>
>>> Thus, what the term "rea-time" implies in this should not be confusing
>>> for folks in this area.
>>>
>>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <jerry.boyang.p...@gmail.com>
>>> wrote:
>>>
>>>> Mich,
>>>>
>>>> If I understood your last email correctly, I think you also wanted to
>>>> have a discussion about naming?  Why are we calling this new execution mode
>>>> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
>>>> "continuous mode" is taken and we want another name to describe an
>>>> execution mode that provides ultra low latency processing.  We could have
>>>> called it "low latency mode", though I don't really like that naming since
>>>> it implies the other execution modes are not low latency which I don't
>>>> believe is true.  This new proposed mode can simply deliver even lower
>>>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
>>>> are talking about "soft" real-time here.  I think when we are talking about
>>>> distributed stream processing systems in the space of big data analytics,
>>>> it is reasonable to assume anything described in this space as "real-time"
>>>> implies "soft" real-time.  Though if this is confusing or misleading, we
>>>> can provide clear documentation on what "real-time" in real-time mode means
>>>> and what it guarantees.  Just my thoughts.  I would love to hear other
>>>> perspectives.
>>>>
>>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> I think from what I have seen there are a good number of +1 responses
>>>>> as opposed to quantitative discussions (based on my observations only).
>>>>> Given the objectives of the thread, we ought to focus on what is meant by
>>>>> real time compared to continuous   modes.To be fair, it is a common point
>>>>> of confusion, and the terms are often used interchangeably in general
>>>>> conversation, but in technical contexts, especially with streaming data
>>>>> platforms, they have specific and important differences.
>>>>>
>>>>> "Continuous Mode" refers to a processing strategy that aims for true,
>>>>> uninterrupted, sub-millisecond latency processing.  Chiefly
>>>>>
>>>>>    - Event-at-a-Time (or very small  batch groups): The system
>>>>>    processes individual events or extremely small groups of events ->
>>>>>    microbatches as they flow through the pipeline.
>>>>>    - Minimal Latency: The primary goal is to achieve the absolute
>>>>>    lowest possible end-to-end latency, often in the order of milliseconds 
>>>>> or
>>>>>    even below
>>>>>    - Most business use cases (say financial markets) can live with
>>>>>    this as they do not rely on rdges
>>>>>
>>>>> Now what is meant by "Real-time Mode"
>>>>>
>>>>> This is where the nuance comes in. "Real-time" is a broader and
>>>>> sometimes more subjective term. When the text introduces "Real-time Mode"
>>>>> as distinct from "Continuous Mode," it suggests a specific implementation
>>>>> that achieves real-time characteristics but might do so differently or 
>>>>> more
>>>>> robustly than a "continuous" mode attempt. Going back to my earlier
>>>>> mention, in real time application , there is nothing as an answer which is
>>>>> supposed to be late and correct. The timeliness is part of the 
>>>>> application.
>>>>> if I get the right answer too slowly it becomes useless or wrong. What I
>>>>> call the "Late and Correct is Useless" Principle
>>>>>
>>>>> In summary, "Real-time Mode" seems to describe an approach that
>>>>> delivers low-latency processing with high reliability and ease of use,
>>>>> leveraging established, battle-tested components.I invite the audience to
>>>>> have a discussion on this.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh,
>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>>>>> > +1
>>>>>> >
>>>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：
>>>>>> >
>>>>>> > > +1.
>>>>>> > >
>>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com>
>>>>>> wrote:
>>>>>> > >
>>>>>> > >> +1
>>>>>> > >> Sent from my iPhone
>>>>>> > >>
>>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org>
>>>>>> wrote:
>>>>>> > >>
>>>>>> > >> 
>>>>>> > >> +1 Nice feature
>>>>>> > >>
>>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <
>>>>>> xyliyuanj...@gmail.com>
>>>>>> > >> wrote:
>>>>>> > >>
>>>>>> > >>> +1
>>>>>> > >>>
>>>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
>>>>>> > >>>
>>>>>> > >>>> +1, LGTM.
>>>>>> > >>>>
>>>>>> > >>>> Kent
>>>>>> > >>>>
>>>>>> > >>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>>>>>> > >>>>
>>>>>> > >>>>> +1. Super excited by this initiative!
>>>>>> > >>>>>
>>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <
>>>>>> yblia...@gmail.com>
>>>>>> > >>>>> wrote:
>>>>>> > >>>>>
>>>>>> > >>>>>> +1
>>>>>> > >>>>>>
>>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>>>>> huaxin.ga...@gmail.com>
>>>>>> > >>>>>> wrote:
>>>>>> > >>>>>>
>>>>>> > >>>>>>> +1
>>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>>>>> > >>>>>>> eliminate the need for separate streaming engines, reducing
>>>>>> system
>>>>>> > >>>>>>> complexity and operational cost. Excited to see this
>>>>>> direction!
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>>> > >>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>> Hi,
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> My point about "in real time application or data, there is
>>>>>> nothing
>>>>>> > >>>>>>>> as an answer which is supposed to be late and correct. The
>>>>>> timeliness is
>>>>>> > >>>>>>>> part of the application. if I get the right answer too
>>>>>> slowly it becomes
>>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need
>>>>>> this
>>>>>> > >>>>>>>> Spark Structured Streaming proposal.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of
>>>>>> the answer
>>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current
>>>>>> streaming
>>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often
>>>>>> delivers results that
>>>>>> > >>>>>>>> are technically "correct" but arrive too late to be truly
>>>>>> useful for
>>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them
>>>>>> "useless or
>>>>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>>>>> *high-frequency
>>>>>> > >>>>>>>> trading,* market data or trade execution commands must be
>>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can
>>>>>> mean missed
>>>>>> > >>>>>>>> opportunities or significant financial losses, making a
>>>>>> "correct" price
>>>>>> > >>>>>>>> update useless if it's not instantaneous. able for these
>>>>>> demanding
>>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not
>>>>>> good enough. As
>>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>>>>> treated as such not
>>>>>> > >>>>>>>> as a comment.in SPIP
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>>>>> > >>>>>>>> Dr Mich Talebzadeh,
>>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>> Analysis |
>>>>>> > >>>>>>>> GDPR
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>    view my Linkedin profile
>>>>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>>>>> denny.g....@gmail.com>
>>>>>> > >>>>>>>> wrote:
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>> Hey Mich,
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your
>>>>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
>>>>>> comments directly
>>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below
>>>>>> is a direct copy
>>>>>> > >>>>>>>>> from the SPIP itself.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Thanks,
>>>>>> > >>>>>>>>> Denny
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>>> just to add
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>>>>> definition of
>>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>>>>> application or
>>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to
>>>>>> be late and
>>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I
>>>>>> get the right
>>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>> Analysis |
>>>>>> > >>>>>>>>>> GDPR
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>    view my Linkedin profile
>>>>>> > >>>>>>>>>> <
>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>> > >>>>>>>>>>
>>>>>> > >>>>>>>>>>> The current limitations in SSS come from
>>>>>> micro-batching.If you
>>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must
>>>>>> be balanced against
>>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>>>>> prevent back pressure
>>>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing
>>>>>> mode, a
>>>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
>>>>>> interval quote
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>> "
>>>>>> > >>>>>>>>>>> df.writeStream
>>>>>> > >>>>>>>>>>>    .format("...")
>>>>>> > >>>>>>>>>>>    .option("...")
>>>>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new
>>>>>> trigger
>>>>>> > >>>>>>>>>>> type to enable real-time Mode
>>>>>> > >>>>>>>>>>>    .start()
>>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run
>>>>>> in the
>>>>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time interval
>>>>>> can also be
>>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long
>>>>>> each micro-batch should
>>>>>> > >>>>>>>>>>> run for.
>>>>>> > >>>>>>>>>>> "
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>>> > >>>>>>>>>>> HTH
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>> Analysis |
>>>>>> > >>>>>>>>>>> GDPR
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>    view my Linkedin profile
>>>>>> > >>>>>>>>>>> <
>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>> > >>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> Hi all,
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming”
>>>>>> that I've been
>>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
>>>>>> Jungtaek Lim, and
>>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>]
>>>>>> [Doc
>>>>>> > >>>>>>>>>>>> <
>>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>>>>> >
>>>>>> > >>>>>>>>>>>> ].
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called
>>>>>> “Real-time Mode”
>>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly
>>>>>> lowers end-to-end latency
>>>>>> > >>>>>>>>>>>> for processing streams of data.
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our
>>>>>> goal is
>>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that
>>>>>> need results almost
>>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
>>>>>> achieve this without
>>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that
>>>>>> users already use – so
>>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>>>>> ultra-low-latency mode by
>>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power
>>>>>> real-time
>>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>>>>> personalization) that
>>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with
>>>>>> Spark’s current streaming
>>>>>> > >>>>>>>>>>>> engine.
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>>> > >>>>>>>>>>>> suggestions on this approach!
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>>>>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>>> --
>>>>>> > >>>>>> Best,
>>>>>> > >>>>>> Yanbo
>>>>>> > >>>>>>
>>>>>> > >>>>>
>>>>>> > >>
>>>>>> > >> --
>>>>>> > >> John Zhuge
>>>>>> > >>
>>>>>> > >>
>>>>>> >
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to