Hi Jerry,

In essence, these definitions (hard or soft) help clarify that "real-time"
is* not a single, monolithic concept here,* but rather a spectrum defined
by the criticality of timeliness and systems under consideration. Common
data processing solutions branded as "real-time" are typically operating on
the softer end of this spectrum, providing performance crucial for
applications under considerations (for example within SLAs)  where delays
are undesirable but not show stopper.

I  therefore suggest the SPIP should mention this explicitly, so we can
move on

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Fri, 30 May 2025 at 07:57, Jerry Peng <jerry.boyang.p...@gmail.com>
wrote:

> Mark,
>
> For real-time systems there is a concept of "soft" real-time and "hard"
> real-time systems.  These concepts exist in textbooks.  Here is a document
> by intel that explains it:
>
>
> https://www.intel.com/content/www/us/en/learn/what-is-a-real-time-system.html
>
> "In a soft real-time system, computers or equipment will continue to
> function after a missed deadline but may produce a lower-quality output.
> For example, latency in online video games can impact player interactions,
> but otherwise present no serious consequences."
>
> "Hard real-time systems have zero delay tolerance, and delayed signals can
> result in total failure or present immediate danger to users. Flight
> control systems and pacemakers are both examples where timeliness is not
> only essential but the lack of it can result in a life-or-death situation."
>
> I don't think it is inaccurate or misleading to call this mode real-time.
> It is soft real-time.
>
> On Thu, May 29, 2025 at 11:44 PM Mark Hamstra <markhams...@gmail.com>
> wrote:
>
>> Clarifying what is meant by "real-time" and explicitly differentiating it
>> from actual real-time computing should be a bare minimum. I still don't
>> like the use of marketing-speak "real-time" that isn't really real-time in
>> engineering documents or API namespaces.
>>
>> On Thu, May 29, 2025 at 10:43 PM Jerry Peng <jerry.boyang.p...@gmail.com>
>> wrote:
>>
>>> Mark,
>>>
>>> I thought we are simply discussing the naming of the mode?  Like I
>>> mentioned, if you think simply calling this mode "real-time" mode may cause
>>> confusion because "real-time" can mean other things in other fields, I can
>>> clarify what we mean by "real-time" explicitly in the SPIP document and any
>>> future documentation. That is not a problem and thank you for your feedback.
>>>
>>> On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <markhams...@gmail.com>
>>> wrote:
>>>
>>>> Referencing other misuse of "real-time" is not persuasive. A SPIP is an
>>>> engineering document, not a marketing document. Technical clarity and
>>>> accuracy should be non-negotiable.
>>>>
>>>>
>>>> On Thu, May 29, 2025 at 10:27 PM Jerry Peng <
>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>
>>>>> Mark,
>>>>>
>>>>> As an example of my point if you go the the Apache Storm (another
>>>>> stream processing engine) website:
>>>>>
>>>>> https://storm.apache.org/
>>>>>
>>>>> It describes Storm as:
>>>>>
>>>>> "Apache Storm is a free and open source distributed *realtime*
>>>>> computation system"
>>>>>
>>>>> If you can to apache Flink:
>>>>>
>>>>>
>>>>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>>>>
>>>>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>>>>
>>>>> Thus, what the term "rea-time" implies in this should not be confusing
>>>>> for folks in this area.
>>>>>
>>>>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <
>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>
>>>>>> Mich,
>>>>>>
>>>>>> If I understood your last email correctly, I think you also wanted to
>>>>>> have a discussion about naming?  Why are we calling this new execution 
>>>>>> mode
>>>>>> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
>>>>>> "continuous mode" is taken and we want another name to describe an
>>>>>> execution mode that provides ultra low latency processing.  We could have
>>>>>> called it "low latency mode", though I don't really like that naming 
>>>>>> since
>>>>>> it implies the other execution modes are not low latency which I don't
>>>>>> believe is true.  This new proposed mode can simply deliver even lower
>>>>>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
>>>>>> are talking about "soft" real-time here.  I think when we are talking 
>>>>>> about
>>>>>> distributed stream processing systems in the space of big data analytics,
>>>>>> it is reasonable to assume anything described in this space as 
>>>>>> "real-time"
>>>>>> implies "soft" real-time.  Though if this is confusing or misleading, we
>>>>>> can provide clear documentation on what "real-time" in real-time mode 
>>>>>> means
>>>>>> and what it guarantees.  Just my thoughts.  I would love to hear other
>>>>>> perspectives.
>>>>>>
>>>>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> I think from what I have seen there are a good number of +1
>>>>>>> responses as opposed to quantitative discussions (based on my 
>>>>>>> observations
>>>>>>> only). Given the objectives of the thread, we ought to focus on what is
>>>>>>> meant by real time compared to continuous   modes.To be fair, it is a
>>>>>>> common point of confusion, and the terms are often used interchangeably 
>>>>>>> in
>>>>>>> general conversation, but in technical contexts, especially with 
>>>>>>> streaming
>>>>>>> data platforms, they have specific and important differences.
>>>>>>>
>>>>>>> "Continuous Mode" refers to a processing strategy that aims for
>>>>>>> true, uninterrupted, sub-millisecond latency processing.  Chiefly
>>>>>>>
>>>>>>>    - Event-at-a-Time (or very small  batch groups): The system
>>>>>>>    processes individual events or extremely small groups of events ->
>>>>>>>    microbatches as they flow through the pipeline.
>>>>>>>    - Minimal Latency: The primary goal is to achieve the absolute
>>>>>>>    lowest possible end-to-end latency, often in the order of 
>>>>>>> milliseconds or
>>>>>>>    even below
>>>>>>>    - Most business use cases (say financial markets) can live with
>>>>>>>    this as they do not rely on rdges
>>>>>>>
>>>>>>> Now what is meant by "Real-time Mode"
>>>>>>>
>>>>>>> This is where the nuance comes in. "Real-time" is a broader and
>>>>>>> sometimes more subjective term. When the text introduces "Real-time 
>>>>>>> Mode"
>>>>>>> as distinct from "Continuous Mode," it suggests a specific 
>>>>>>> implementation
>>>>>>> that achieves real-time characteristics but might do so differently or 
>>>>>>> more
>>>>>>> robustly than a "continuous" mode attempt. Going back to my earlier
>>>>>>> mention, in real time application , there is nothing as an answer which 
>>>>>>> is
>>>>>>> supposed to be late and correct. The timeliness is part of the 
>>>>>>> application.
>>>>>>> if I get the right answer too slowly it becomes useless or wrong. What I
>>>>>>> call the "Late and Correct is Useless" Principle
>>>>>>>
>>>>>>> In summary, "Real-time Mode" seems to describe an approach that
>>>>>>> delivers low-latency processing with high reliability and ease of use,
>>>>>>> leveraging established, battle-tested components.I invite the audience 
>>>>>>> to
>>>>>>> have a discussion on this.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Dr Mich Talebzadeh,
>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>>>>>>> > +1
>>>>>>>> >
>>>>>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道:
>>>>>>>> >
>>>>>>>> > > +1.
>>>>>>>> > >
>>>>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com>
>>>>>>>> wrote:
>>>>>>>> > >
>>>>>>>> > >> +1
>>>>>>>> > >> Sent from my iPhone
>>>>>>>> > >>
>>>>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org>
>>>>>>>> wrote:
>>>>>>>> > >>
>>>>>>>> > >> 
>>>>>>>> > >> +1 Nice feature
>>>>>>>> > >>
>>>>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <
>>>>>>>> xyliyuanj...@gmail.com>
>>>>>>>> > >> wrote:
>>>>>>>> > >>
>>>>>>>> > >>> +1
>>>>>>>> > >>>
>>>>>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道:
>>>>>>>> > >>>
>>>>>>>> > >>>> +1, LGTM.
>>>>>>>> > >>>>
>>>>>>>> > >>>> Kent
>>>>>>>> > >>>>
>>>>>>>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道:
>>>>>>>> > >>>>
>>>>>>>> > >>>>> +1. Super excited by this initiative!
>>>>>>>> > >>>>>
>>>>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <
>>>>>>>> yblia...@gmail.com>
>>>>>>>> > >>>>> wrote:
>>>>>>>> > >>>>>
>>>>>>>> > >>>>>> +1
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>>>>>>> huaxin.ga...@gmail.com>
>>>>>>>> > >>>>>> wrote:
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>>> +1
>>>>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we
>>>>>>>> can
>>>>>>>> > >>>>>>> eliminate the need for separate streaming engines,
>>>>>>>> reducing system
>>>>>>>> > >>>>>>> complexity and operational cost. Excited to see this
>>>>>>>> direction!
>>>>>>>> > >>>>>>>
>>>>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>>>>> > >>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>> > >>>>>>>
>>>>>>>> > >>>>>>>> Hi,
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>> My point about "in real time application or data, there
>>>>>>>> is nothing
>>>>>>>> > >>>>>>>> as an answer which is supposed to be late and correct.
>>>>>>>> The timeliness is
>>>>>>>> > >>>>>>>> part of the application. if I get the right answer too
>>>>>>>> slowly it becomes
>>>>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we
>>>>>>>> need this
>>>>>>>> > >>>>>>>> Spark Structured Streaming proposal.
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>>>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of
>>>>>>>> the answer
>>>>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current
>>>>>>>> streaming
>>>>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often
>>>>>>>> delivers results that
>>>>>>>> > >>>>>>>> are technically "correct" but arrive too late to be
>>>>>>>> truly useful for
>>>>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes
>>>>>>>> them "useless or
>>>>>>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>>>>>>> *high-frequency
>>>>>>>> > >>>>>>>> trading,* market data or trade execution commands must be
>>>>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can
>>>>>>>> mean missed
>>>>>>>> > >>>>>>>> opportunities or significant financial losses, making a
>>>>>>>> "correct" price
>>>>>>>> > >>>>>>>> update useless if it's not instantaneous. able for these
>>>>>>>> demanding
>>>>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply
>>>>>>>> not good enough. As
>>>>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>>>>>>> treated as such not
>>>>>>>> > >>>>>>>> as a comment.in SPIP
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>>>>>>> > >>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>>>> Analysis |
>>>>>>>> > >>>>>>>> GDPR
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>>    view my Linkedin profile
>>>>>>>> > >>>>>>>> <
>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>>>>>>> denny.g....@gmail.com>
>>>>>>>> > >>>>>>>> wrote:
>>>>>>>> > >>>>>>>>
>>>>>>>> > >>>>>>>>> Hey Mich,
>>>>>>>> > >>>>>>>>>
>>>>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does
>>>>>>>> your
>>>>>>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
>>>>>>>> comments directly
>>>>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet
>>>>>>>> below is a direct copy
>>>>>>>> > >>>>>>>>> from the SPIP itself.
>>>>>>>> > >>>>>>>>>
>>>>>>>> > >>>>>>>>> Thanks,
>>>>>>>> > >>>>>>>>> Denny
>>>>>>>> > >>>>>>>>>
>>>>>>>> > >>>>>>>>>
>>>>>>>> > >>>>>>>>>
>>>>>>>> > >>>>>>>>>
>>>>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>> > >>>>>>>>>
>>>>>>>> > >>>>>>>>>> just to add
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>>>>>>> definition of
>>>>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>>>>>>> application or
>>>>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed
>>>>>>>> to be late and
>>>>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if
>>>>>>>> I get the right
>>>>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>>>> Analysis |
>>>>>>>> > >>>>>>>>>> GDPR
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>    view my Linkedin profile
>>>>>>>> > >>>>>>>>>> <
>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>> > >>>>>>>>>>
>>>>>>>> > >>>>>>>>>>> The current limitations in SSS come from
>>>>>>>> micro-batching.If you
>>>>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction
>>>>>>>> must be balanced against
>>>>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>>>>>>> prevent back pressure
>>>>>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing
>>>>>>>> mode, a
>>>>>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
>>>>>>>> interval quote
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>> "
>>>>>>>> > >>>>>>>>>>> df.writeStream
>>>>>>>> > >>>>>>>>>>>    .format("...")
>>>>>>>> > >>>>>>>>>>>    .option("...")
>>>>>>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    //
>>>>>>>> new trigger
>>>>>>>> > >>>>>>>>>>> type to enable real-time Mode
>>>>>>>> > >>>>>>>>>>>    .start()
>>>>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should
>>>>>>>> run in the
>>>>>>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time
>>>>>>>> interval can also be
>>>>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long
>>>>>>>> each micro-batch should
>>>>>>>> > >>>>>>>>>>> run for.
>>>>>>>> > >>>>>>>>>>> "
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that
>>>>>>>> simple
>>>>>>>> > >>>>>>>>>>> HTH
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>>>> Analysis |
>>>>>>>> > >>>>>>>>>>> GDPR
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>    view my Linkedin profile
>>>>>>>> > >>>>>>>>>>> <
>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>>>> > >>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>> Hi all,
>>>>>>>> > >>>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP
>>>>>>>> titled
>>>>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured
>>>>>>>> Streaming” that I've been
>>>>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
>>>>>>>> Jungtaek Lim, and
>>>>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>]
>>>>>>>> [Doc
>>>>>>>> > >>>>>>>>>>>> <
>>>>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>>>>>>> >
>>>>>>>> > >>>>>>>>>>>> ].
>>>>>>>> > >>>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called
>>>>>>>> “Real-time Mode”
>>>>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly
>>>>>>>> lowers end-to-end latency
>>>>>>>> > >>>>>>>>>>>> for processing streams of data.
>>>>>>>> > >>>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility.
>>>>>>>> Our goal is
>>>>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs
>>>>>>>> that need results almost
>>>>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
>>>>>>>> achieve this without
>>>>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that
>>>>>>>> users already use – so
>>>>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>>>>>>> ultra-low-latency mode by
>>>>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>> > >>>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power
>>>>>>>> real-time
>>>>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>>>>>>> personalization) that
>>>>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with
>>>>>>>> Spark’s current streaming
>>>>>>>> > >>>>>>>>>>>> engine.
>>>>>>>> > >>>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>>>>> > >>>>>>>>>>>> suggestions on this approach!
>>>>>>>> > >>>>>>>>>>>>
>>>>>>>> > >>>>>>>>>>>>
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>> --
>>>>>>>> > >>>>>> Best,
>>>>>>>> > >>>>>> Yanbo
>>>>>>>> > >>>>>>
>>>>>>>> > >>>>>
>>>>>>>> > >>
>>>>>>>> > >> --
>>>>>>>> > >> John Zhuge
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>

Reply via email to