Mark, As an example of my point if you go the the Apache Storm (another stream processing engine) website:
https://storm.apache.org/ It describes Storm as: "Apache Storm is a free and open source distributed *realtime* computation system" If you can to apache Flink: https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/ "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing" Thus, what the term "rea-time" implies in this should not be confusing for folks in this area. On Thu, May 29, 2025 at 10:22 PM Jerry Peng <jerry.boyang.p...@gmail.com> wrote: > Mich, > > If I understood your last email correctly, I think you also wanted to have > a discussion about naming? Why are we calling this new execution mode > described in the SPIP "Real-time Mode"? Here are my two cents. Firstly, > "continuous mode" is taken and we want another name to describe an > execution mode that provides ultra low latency processing. We could have > called it "low latency mode", though I don't really like that naming since > it implies the other execution modes are not low latency which I don't > believe is true. This new proposed mode can simply deliver even lower > latency. Thus, we came up with the name "Real-time Mode". Of course, we > are talking about "soft" real-time here. I think when we are talking about > distributed stream processing systems in the space of big data analytics, > it is reasonable to assume anything described in this space as "real-time" > implies "soft" real-time. Though if this is confusing or misleading, we > can provide clear documentation on what "real-time" in real-time mode means > and what it guarantees. Just my thoughts. I would love to hear other > perspectives. > > On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> I think from what I have seen there are a good number of +1 responses as >> opposed to quantitative discussions (based on my observations only). Given >> the objectives of the thread, we ought to focus on what is meant by real >> time compared to continuous modes.To be fair, it is a common point of >> confusion, and the terms are often used interchangeably in general >> conversation, but in technical contexts, especially with streaming data >> platforms, they have specific and important differences. >> >> "Continuous Mode" refers to a processing strategy that aims for true, >> uninterrupted, sub-millisecond latency processing. Chiefly >> >> - Event-at-a-Time (or very small batch groups): The system processes >> individual events or extremely small groups of events -> microbatches as >> they flow through the pipeline. >> - Minimal Latency: The primary goal is to achieve the absolute lowest >> possible end-to-end latency, often in the order of milliseconds or even >> below >> - Most business use cases (say financial markets) can live with this >> as they do not rely on rdges >> >> Now what is meant by "Real-time Mode" >> >> This is where the nuance comes in. "Real-time" is a broader and sometimes >> more subjective term. When the text introduces "Real-time Mode" as distinct >> from "Continuous Mode," it suggests a specific implementation that achieves >> real-time characteristics but might do so differently or more robustly than >> a "continuous" mode attempt. Going back to my earlier mention, in real time >> application , there is nothing as an answer which is supposed to be late >> and correct. The timeliness is part of the application. if I get the >> right answer too slowly it becomes useless or wrong. What I call the "Late >> and Correct is Useless" Principle >> >> In summary, "Real-time Mode" seems to describe an approach that delivers >> low-latency processing with high reliability and ease of use, leveraging >> established, battle-tested components.I invite the audience to have a >> discussion on this. >> >> HTH >> >> Dr Mich Talebzadeh, >> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> >> >> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote: >> >>> +1 >>> >>> On 2025/05/29 16:25:19 Xiao Li wrote: >>> > +1 >>> > >>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道: >>> > >>> > > +1. >>> > > >>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote: >>> > > >>> > >> +1 >>> > >> Sent from my iPhone >>> > >> >>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote: >>> > >> >>> > >> >>> > >> +1 Nice feature >>> > >> >>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com >>> > >>> > >> wrote: >>> > >> >>> > >>> +1 >>> > >>> >>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >>> > >>> >>> > >>>> +1, LGTM. >>> > >>>> >>> > >>>> Kent >>> > >>>> >>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>> > >>>> >>> > >>>>> +1. Super excited by this initiative! >>> > >>>>> >>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> >>> > >>>>> wrote: >>> > >>>>> >>> > >>>>>> +1 >>> > >>>>>> >>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao < >>> huaxin.ga...@gmail.com> >>> > >>>>>> wrote: >>> > >>>>>> >>> > >>>>>>> +1 >>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can >>> > >>>>>>> eliminate the need for separate streaming engines, reducing >>> system >>> > >>>>>>> complexity and operational cost. Excited to see this direction! >>> > >>>>>>> >>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>> > >>>>>>> mich.talebza...@gmail.com> wrote: >>> > >>>>>>> >>> > >>>>>>>> Hi, >>> > >>>>>>>> >>> > >>>>>>>> My point about "in real time application or data, there is >>> nothing >>> > >>>>>>>> as an answer which is supposed to be late and correct. The >>> timeliness is >>> > >>>>>>>> part of the application. if I get the right answer too slowly >>> it becomes >>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need >>> this >>> > >>>>>>>> Spark Structured Streaming proposal. >>> > >>>>>>>> >>> > >>>>>>>> The proposal is precisely about enabling Spark to power >>> > >>>>>>>> applications where, as I define it, the *timeliness* of the >>> answer >>> > >>>>>>>> is as critical as its *correctness*. Spark's current streaming >>> > >>>>>>>> engine, primarily operating on micro-batches, often delivers >>> results that >>> > >>>>>>>> are technically "correct" but arrive too late to be truly >>> useful for >>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them >>> "useless or >>> > >>>>>>>> wrong" in a practical, business-critical sense. >>> > >>>>>>>> >>> > >>>>>>>> For example *in real-time fraud detection* and In >>> *high-frequency >>> > >>>>>>>> trading,* market data or trade execution commands must be >>> > >>>>>>>> delivered with minimal latency. Even a slight delay can mean >>> missed >>> > >>>>>>>> opportunities or significant financial losses, making a >>> "correct" price >>> > >>>>>>>> update useless if it's not instantaneous. able for these >>> demanding >>> > >>>>>>>> use cases, where a "late but correct" answer is simply not >>> good enough. As >>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be >>> treated as such not >>> > >>>>>>>> as a comment.in SPIP >>> > >>>>>>>> >>> > >>>>>>>> Hope this clarifies the connection in practical terms >>> > >>>>>>>> Dr Mich Talebzadeh, >>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic >>> Analysis | >>> > >>>>>>>> GDPR >>> > >>>>>>>> >>> > >>>>>>>> view my Linkedin profile >>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> >>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee < >>> denny.g....@gmail.com> >>> > >>>>>>>> wrote: >>> > >>>>>>>> >>> > >>>>>>>>> Hey Mich, >>> > >>>>>>>>> >>> > >>>>>>>>> Sorry, I may be missing something here but what does your >>> > >>>>>>>>> definition here have to do with the SPIP? Perhaps add >>> comments directly >>> > >>>>>>>>> to the SPIP to provide context as the code snippet below is >>> a direct copy >>> > >>>>>>>>> from the SPIP itself. >>> > >>>>>>>>> >>> > >>>>>>>>> Thanks, >>> > >>>>>>>>> Denny >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> >>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>> > >>>>>>>>> mich.talebza...@gmail.com> wrote: >>> > >>>>>>>>> >>> > >>>>>>>>>> just to add >>> > >>>>>>>>>> >>> > >>>>>>>>>> A stronger definition of real time. The engineering >>> definition of >>> > >>>>>>>>>> real time is roughly fast enough to be interactive >>> > >>>>>>>>>> >>> > >>>>>>>>>> However, I put a stronger definition. In real time >>> application or >>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to be >>> late and >>> > >>>>>>>>>> correct. The timeliness is part of the application.if I get >>> the right >>> > >>>>>>>>>> answer too slowly it becomes useless or wrong >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> Dr Mich Talebzadeh, >>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>> Analysis | >>> > >>>>>>>>>> GDPR >>> > >>>>>>>>>> >>> > >>>>>>>>>> view my Linkedin profile >>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> >>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>> > >>>>>>>>>> >>> > >>>>>>>>>>> The current limitations in SSS come from micro-batching.If >>> you >>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must be >>> balanced against >>> > >>>>>>>>>>> the available processing capacity of the cluster to >>> prevent back pressure >>> > >>>>>>>>>>> and instability. In the case of Continuous Processing >>> mode, a >>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint >>> interval quote >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> " >>> > >>>>>>>>>>> df.writeStream >>> > >>>>>>>>>>> .format("...") >>> > >>>>>>>>>>> .option("...") >>> > >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new >>> trigger >>> > >>>>>>>>>>> type to enable real-time Mode >>> > >>>>>>>>>>> .start() >>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run in >>> the >>> > >>>>>>>>>>> new ultra low-latency execution mode. A time interval can >>> also be >>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each >>> micro-batch should >>> > >>>>>>>>>>> run for. >>> > >>>>>>>>>>> " >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple >>> > >>>>>>>>>>> HTH >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> Dr Mich Talebzadeh, >>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>> Analysis | >>> > >>>>>>>>>>> GDPR >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> view my Linkedin profile >>> > >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> >>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>> > >>>>>>>>>>> >>> > >>>>>>>>>>>> Hi all, >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” >>> that I've been >>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, >>> Jungtaek Lim, and >>> > >>>>>>>>>>>> Michael Armbrust: [JIRA >>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>> > >>>>>>>>>>>> < >>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing >>> > >>> > >>>>>>>>>>>> ]. >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time >>> Mode” >>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers >>> end-to-end latency >>> > >>>>>>>>>>>> for processing streams of data. >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our >>> goal is >>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that >>> need results almost >>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to >>> achieve this without >>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users >>> already use – so >>> > >>>>>>>>>>>> existing streaming queries can run in this new >>> ultra-low-latency mode by >>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic. >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time >>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live >>> personalization) that >>> > >>>>>>>>>>>> today cannot meet their latency requirements with Spark’s >>> current streaming >>> > >>>>>>>>>>>> engine. >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and >>> > >>>>>>>>>>>> suggestions on this approach! >>> > >>>>>>>>>>>> >>> > >>>>>>>>>>>> >>> > >>>>>> >>> > >>>>>> -- >>> > >>>>>> Best, >>> > >>>>>> Yanbo >>> > >>>>>> >>> > >>>>> >>> > >> >>> > >> -- >>> > >> John Zhuge >>> > >> >>> > >> >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>