Hi Jerry, In essence, these definitions (hard or soft) help clarify that "real-time" is* not a single, monolithic concept here,* but rather a spectrum defined by the criticality of timeliness and systems under consideration. Common data processing solutions branded as "real-time" are typically operating on the softer end of this spectrum, providing performance crucial for applications under considerations (for example within SLAs) where delays are undesirable but not show stopper.
I therefore suggest the SPIP should mention this explicitly, so we can move on Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Fri, 30 May 2025 at 07:57, Jerry Peng <jerry.boyang.p...@gmail.com> wrote: > Mark, > > For real-time systems there is a concept of "soft" real-time and "hard" > real-time systems. These concepts exist in textbooks. Here is a document > by intel that explains it: > > > https://www.intel.com/content/www/us/en/learn/what-is-a-real-time-system.html > > "In a soft real-time system, computers or equipment will continue to > function after a missed deadline but may produce a lower-quality output. > For example, latency in online video games can impact player interactions, > but otherwise present no serious consequences." > > "Hard real-time systems have zero delay tolerance, and delayed signals can > result in total failure or present immediate danger to users. Flight > control systems and pacemakers are both examples where timeliness is not > only essential but the lack of it can result in a life-or-death situation." > > I don't think it is inaccurate or misleading to call this mode real-time. > It is soft real-time. > > On Thu, May 29, 2025 at 11:44 PM Mark Hamstra <markhams...@gmail.com> > wrote: > >> Clarifying what is meant by "real-time" and explicitly differentiating it >> from actual real-time computing should be a bare minimum. I still don't >> like the use of marketing-speak "real-time" that isn't really real-time in >> engineering documents or API namespaces. >> >> On Thu, May 29, 2025 at 10:43 PM Jerry Peng <jerry.boyang.p...@gmail.com> >> wrote: >> >>> Mark, >>> >>> I thought we are simply discussing the naming of the mode? Like I >>> mentioned, if you think simply calling this mode "real-time" mode may cause >>> confusion because "real-time" can mean other things in other fields, I can >>> clarify what we mean by "real-time" explicitly in the SPIP document and any >>> future documentation. That is not a problem and thank you for your feedback. >>> >>> On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <markhams...@gmail.com> >>> wrote: >>> >>>> Referencing other misuse of "real-time" is not persuasive. A SPIP is an >>>> engineering document, not a marketing document. Technical clarity and >>>> accuracy should be non-negotiable. >>>> >>>> >>>> On Thu, May 29, 2025 at 10:27 PM Jerry Peng < >>>> jerry.boyang.p...@gmail.com> wrote: >>>> >>>>> Mark, >>>>> >>>>> As an example of my point if you go the the Apache Storm (another >>>>> stream processing engine) website: >>>>> >>>>> https://storm.apache.org/ >>>>> >>>>> It describes Storm as: >>>>> >>>>> "Apache Storm is a free and open source distributed *realtime* >>>>> computation system" >>>>> >>>>> If you can to apache Flink: >>>>> >>>>> >>>>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/ >>>>> >>>>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing" >>>>> >>>>> Thus, what the term "rea-time" implies in this should not be confusing >>>>> for folks in this area. >>>>> >>>>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng < >>>>> jerry.boyang.p...@gmail.com> wrote: >>>>> >>>>>> Mich, >>>>>> >>>>>> If I understood your last email correctly, I think you also wanted to >>>>>> have a discussion about naming? Why are we calling this new execution >>>>>> mode >>>>>> described in the SPIP "Real-time Mode"? Here are my two cents. Firstly, >>>>>> "continuous mode" is taken and we want another name to describe an >>>>>> execution mode that provides ultra low latency processing. We could have >>>>>> called it "low latency mode", though I don't really like that naming >>>>>> since >>>>>> it implies the other execution modes are not low latency which I don't >>>>>> believe is true. This new proposed mode can simply deliver even lower >>>>>> latency. Thus, we came up with the name "Real-time Mode". Of course, we >>>>>> are talking about "soft" real-time here. I think when we are talking >>>>>> about >>>>>> distributed stream processing systems in the space of big data analytics, >>>>>> it is reasonable to assume anything described in this space as >>>>>> "real-time" >>>>>> implies "soft" real-time. Though if this is confusing or misleading, we >>>>>> can provide clear documentation on what "real-time" in real-time mode >>>>>> means >>>>>> and what it guarantees. Just my thoughts. I would love to hear other >>>>>> perspectives. >>>>>> >>>>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> I think from what I have seen there are a good number of +1 >>>>>>> responses as opposed to quantitative discussions (based on my >>>>>>> observations >>>>>>> only). Given the objectives of the thread, we ought to focus on what is >>>>>>> meant by real time compared to continuous modes.To be fair, it is a >>>>>>> common point of confusion, and the terms are often used interchangeably >>>>>>> in >>>>>>> general conversation, but in technical contexts, especially with >>>>>>> streaming >>>>>>> data platforms, they have specific and important differences. >>>>>>> >>>>>>> "Continuous Mode" refers to a processing strategy that aims for >>>>>>> true, uninterrupted, sub-millisecond latency processing. Chiefly >>>>>>> >>>>>>> - Event-at-a-Time (or very small batch groups): The system >>>>>>> processes individual events or extremely small groups of events -> >>>>>>> microbatches as they flow through the pipeline. >>>>>>> - Minimal Latency: The primary goal is to achieve the absolute >>>>>>> lowest possible end-to-end latency, often in the order of >>>>>>> milliseconds or >>>>>>> even below >>>>>>> - Most business use cases (say financial markets) can live with >>>>>>> this as they do not rely on rdges >>>>>>> >>>>>>> Now what is meant by "Real-time Mode" >>>>>>> >>>>>>> This is where the nuance comes in. "Real-time" is a broader and >>>>>>> sometimes more subjective term. When the text introduces "Real-time >>>>>>> Mode" >>>>>>> as distinct from "Continuous Mode," it suggests a specific >>>>>>> implementation >>>>>>> that achieves real-time characteristics but might do so differently or >>>>>>> more >>>>>>> robustly than a "continuous" mode attempt. Going back to my earlier >>>>>>> mention, in real time application , there is nothing as an answer which >>>>>>> is >>>>>>> supposed to be late and correct. The timeliness is part of the >>>>>>> application. >>>>>>> if I get the right answer too slowly it becomes useless or wrong. What I >>>>>>> call the "Late and Correct is Useless" Principle >>>>>>> >>>>>>> In summary, "Real-time Mode" seems to describe an approach that >>>>>>> delivers low-latency processing with high reliability and ease of use, >>>>>>> leveraging established, battle-tested components.I invite the audience >>>>>>> to >>>>>>> have a discussion on this. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Dr Mich Talebzadeh, >>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote: >>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> On 2025/05/29 16:25:19 Xiao Li wrote: >>>>>>>> > +1 >>>>>>>> > >>>>>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道: >>>>>>>> > >>>>>>>> > > +1. >>>>>>>> > > >>>>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> >>>>>>>> wrote: >>>>>>>> > > >>>>>>>> > >> +1 >>>>>>>> > >> Sent from my iPhone >>>>>>>> > >> >>>>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> >>>>>>>> wrote: >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> +1 Nice feature >>>>>>>> > >> >>>>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li < >>>>>>>> xyliyuanj...@gmail.com> >>>>>>>> > >> wrote: >>>>>>>> > >> >>>>>>>> > >>> +1 >>>>>>>> > >>> >>>>>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >>>>>>>> > >>> >>>>>>>> > >>>> +1, LGTM. >>>>>>>> > >>>> >>>>>>>> > >>>> Kent >>>>>>>> > >>>> >>>>>>>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>>>>>>> > >>>> >>>>>>>> > >>>>> +1. Super excited by this initiative! >>>>>>>> > >>>>> >>>>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang < >>>>>>>> yblia...@gmail.com> >>>>>>>> > >>>>> wrote: >>>>>>>> > >>>>> >>>>>>>> > >>>>>> +1 >>>>>>>> > >>>>>> >>>>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao < >>>>>>>> huaxin.ga...@gmail.com> >>>>>>>> > >>>>>> wrote: >>>>>>>> > >>>>>> >>>>>>>> > >>>>>>> +1 >>>>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we >>>>>>>> can >>>>>>>> > >>>>>>> eliminate the need for separate streaming engines, >>>>>>>> reducing system >>>>>>>> > >>>>>>> complexity and operational cost. Excited to see this >>>>>>>> direction! >>>>>>>> > >>>>>>> >>>>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>>>>> > >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> > >>>>>>> >>>>>>>> > >>>>>>>> Hi, >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> My point about "in real time application or data, there >>>>>>>> is nothing >>>>>>>> > >>>>>>>> as an answer which is supposed to be late and correct. >>>>>>>> The timeliness is >>>>>>>> > >>>>>>>> part of the application. if I get the right answer too >>>>>>>> slowly it becomes >>>>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we >>>>>>>> need this >>>>>>>> > >>>>>>>> Spark Structured Streaming proposal. >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power >>>>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of >>>>>>>> the answer >>>>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current >>>>>>>> streaming >>>>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often >>>>>>>> delivers results that >>>>>>>> > >>>>>>>> are technically "correct" but arrive too late to be >>>>>>>> truly useful for >>>>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes >>>>>>>> them "useless or >>>>>>>> > >>>>>>>> wrong" in a practical, business-critical sense. >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> For example *in real-time fraud detection* and In >>>>>>>> *high-frequency >>>>>>>> > >>>>>>>> trading,* market data or trade execution commands must be >>>>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can >>>>>>>> mean missed >>>>>>>> > >>>>>>>> opportunities or significant financial losses, making a >>>>>>>> "correct" price >>>>>>>> > >>>>>>>> update useless if it's not instantaneous. able for these >>>>>>>> demanding >>>>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply >>>>>>>> not good enough. As >>>>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be >>>>>>>> treated as such not >>>>>>>> > >>>>>>>> as a comment.in SPIP >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms >>>>>>>> > >>>>>>>> Dr Mich Talebzadeh, >>>>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>>>>> Analysis | >>>>>>>> > >>>>>>>> GDPR >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> view my Linkedin profile >>>>>>>> > >>>>>>>> < >>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee < >>>>>>>> denny.g....@gmail.com> >>>>>>>> > >>>>>>>> wrote: >>>>>>>> > >>>>>>>> >>>>>>>> > >>>>>>>>> Hey Mich, >>>>>>>> > >>>>>>>>> >>>>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does >>>>>>>> your >>>>>>>> > >>>>>>>>> definition here have to do with the SPIP? Perhaps add >>>>>>>> comments directly >>>>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet >>>>>>>> below is a direct copy >>>>>>>> > >>>>>>>>> from the SPIP itself. >>>>>>>> > >>>>>>>>> >>>>>>>> > >>>>>>>>> Thanks, >>>>>>>> > >>>>>>>>> Denny >>>>>>>> > >>>>>>>>> >>>>>>>> > >>>>>>>>> >>>>>>>> > >>>>>>>>> >>>>>>>> > >>>>>>>>> >>>>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> > >>>>>>>>> >>>>>>>> > >>>>>>>>>> just to add >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering >>>>>>>> definition of >>>>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time >>>>>>>> application or >>>>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed >>>>>>>> to be late and >>>>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if >>>>>>>> I get the right >>>>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>>>>> Analysis | >>>>>>>> > >>>>>>>>>> GDPR >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> view my Linkedin profile >>>>>>>> > >>>>>>>>>> < >>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> > >>>>>>>>>> >>>>>>>> > >>>>>>>>>>> The current limitations in SSS come from >>>>>>>> micro-batching.If you >>>>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction >>>>>>>> must be balanced against >>>>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to >>>>>>>> prevent back pressure >>>>>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing >>>>>>>> mode, a >>>>>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint >>>>>>>> interval quote >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> " >>>>>>>> > >>>>>>>>>>> df.writeStream >>>>>>>> > >>>>>>>>>>> .format("...") >>>>>>>> > >>>>>>>>>>> .option("...") >>>>>>>> > >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // >>>>>>>> new trigger >>>>>>>> > >>>>>>>>>>> type to enable real-time Mode >>>>>>>> > >>>>>>>>>>> .start() >>>>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should >>>>>>>> run in the >>>>>>>> > >>>>>>>>>>> new ultra low-latency execution mode. A time >>>>>>>> interval can also be >>>>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long >>>>>>>> each micro-batch should >>>>>>>> > >>>>>>>>>>> run for. >>>>>>>> > >>>>>>>>>>> " >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that >>>>>>>> simple >>>>>>>> > >>>>>>>>>>> HTH >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>>>>> Analysis | >>>>>>>> > >>>>>>>>>>> GDPR >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> view my Linkedin profile >>>>>>>> > >>>>>>>>>>> < >>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>>>> > >>>>>>>>>>> >>>>>>>> > >>>>>>>>>>>> Hi all, >>>>>>>> > >>>>>>>>>>>> >>>>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP >>>>>>>> titled >>>>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured >>>>>>>> Streaming” that I've been >>>>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, >>>>>>>> Jungtaek Lim, and >>>>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA >>>>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] >>>>>>>> [Doc >>>>>>>> > >>>>>>>>>>>> < >>>>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing >>>>>>>> > >>>>>>>> > >>>>>>>>>>>> ]. >>>>>>>> > >>>>>>>>>>>> >>>>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called >>>>>>>> “Real-time Mode” >>>>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly >>>>>>>> lowers end-to-end latency >>>>>>>> > >>>>>>>>>>>> for processing streams of data. >>>>>>>> > >>>>>>>>>>>> >>>>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. >>>>>>>> Our goal is >>>>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs >>>>>>>> that need results almost >>>>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to >>>>>>>> achieve this without >>>>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that >>>>>>>> users already use – so >>>>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new >>>>>>>> ultra-low-latency mode by >>>>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic. >>>>>>>> > >>>>>>>>>>>> >>>>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power >>>>>>>> real-time >>>>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live >>>>>>>> personalization) that >>>>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with >>>>>>>> Spark’s current streaming >>>>>>>> > >>>>>>>>>>>> engine. >>>>>>>> > >>>>>>>>>>>> >>>>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and >>>>>>>> > >>>>>>>>>>>> suggestions on this approach! >>>>>>>> > >>>>>>>>>>>> >>>>>>>> > >>>>>>>>>>>> >>>>>>>> > >>>>>> >>>>>>>> > >>>>>> -- >>>>>>>> > >>>>>> Best, >>>>>>>> > >>>>>> Yanbo >>>>>>>> > >>>>>> >>>>>>>> > >>>>> >>>>>>>> > >> >>>>>>>> > >> -- >>>>>>>> > >> John Zhuge >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>>