Mark, I thought we are simply discussing the naming of the mode? Like I mentioned, if you think simply calling this mode "real-time" mode may cause confusion because "real-time" can mean other things in other fields, I can clarify what we mean by "real-time" explicitly in the SPIP document and any future documentation. That is not a problem and thank you for your feedback.
On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <markhams...@gmail.com> wrote: > Referencing other misuse of "real-time" is not persuasive. A SPIP is an > engineering document, not a marketing document. Technical clarity and > accuracy should be non-negotiable. > > > On Thu, May 29, 2025 at 10:27 PM Jerry Peng <jerry.boyang.p...@gmail.com> > wrote: > >> Mark, >> >> As an example of my point if you go the the Apache Storm (another stream >> processing engine) website: >> >> https://storm.apache.org/ >> >> It describes Storm as: >> >> "Apache Storm is a free and open source distributed *realtime* >> computation system" >> >> If you can to apache Flink: >> >> >> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/ >> >> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing" >> >> Thus, what the term "rea-time" implies in this should not be confusing >> for folks in this area. >> >> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <jerry.boyang.p...@gmail.com> >> wrote: >> >>> Mich, >>> >>> If I understood your last email correctly, I think you also wanted to >>> have a discussion about naming? Why are we calling this new execution mode >>> described in the SPIP "Real-time Mode"? Here are my two cents. Firstly, >>> "continuous mode" is taken and we want another name to describe an >>> execution mode that provides ultra low latency processing. We could have >>> called it "low latency mode", though I don't really like that naming since >>> it implies the other execution modes are not low latency which I don't >>> believe is true. This new proposed mode can simply deliver even lower >>> latency. Thus, we came up with the name "Real-time Mode". Of course, we >>> are talking about "soft" real-time here. I think when we are talking about >>> distributed stream processing systems in the space of big data analytics, >>> it is reasonable to assume anything described in this space as "real-time" >>> implies "soft" real-time. Though if this is confusing or misleading, we >>> can provide clear documentation on what "real-time" in real-time mode means >>> and what it guarantees. Just my thoughts. I would love to hear other >>> perspectives. >>> >>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> I think from what I have seen there are a good number of +1 responses >>>> as opposed to quantitative discussions (based on my observations only). >>>> Given the objectives of the thread, we ought to focus on what is meant by >>>> real time compared to continuous modes.To be fair, it is a common point >>>> of confusion, and the terms are often used interchangeably in general >>>> conversation, but in technical contexts, especially with streaming data >>>> platforms, they have specific and important differences. >>>> >>>> "Continuous Mode" refers to a processing strategy that aims for true, >>>> uninterrupted, sub-millisecond latency processing. Chiefly >>>> >>>> - Event-at-a-Time (or very small batch groups): The system >>>> processes individual events or extremely small groups of events -> >>>> microbatches as they flow through the pipeline. >>>> - Minimal Latency: The primary goal is to achieve the absolute >>>> lowest possible end-to-end latency, often in the order of milliseconds >>>> or >>>> even below >>>> - Most business use cases (say financial markets) can live with >>>> this as they do not rely on rdges >>>> >>>> Now what is meant by "Real-time Mode" >>>> >>>> This is where the nuance comes in. "Real-time" is a broader and >>>> sometimes more subjective term. When the text introduces "Real-time Mode" >>>> as distinct from "Continuous Mode," it suggests a specific implementation >>>> that achieves real-time characteristics but might do so differently or more >>>> robustly than a "continuous" mode attempt. Going back to my earlier >>>> mention, in real time application , there is nothing as an answer which is >>>> supposed to be late and correct. The timeliness is part of the application. >>>> if I get the right answer too slowly it becomes useless or wrong. What I >>>> call the "Late and Correct is Useless" Principle >>>> >>>> In summary, "Real-time Mode" seems to describe an approach that >>>> delivers low-latency processing with high reliability and ease of use, >>>> leveraging established, battle-tested components.I invite the audience to >>>> have a discussion on this. >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh, >>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote: >>>> >>>>> +1 >>>>> >>>>> On 2025/05/29 16:25:19 Xiao Li wrote: >>>>> > +1 >>>>> > >>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道: >>>>> > >>>>> > > +1. >>>>> > > >>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote: >>>>> > > >>>>> > >> +1 >>>>> > >> Sent from my iPhone >>>>> > >> >>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> >>>>> wrote: >>>>> > >> >>>>> > >> >>>>> > >> +1 Nice feature >>>>> > >> >>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li < >>>>> xyliyuanj...@gmail.com> >>>>> > >> wrote: >>>>> > >> >>>>> > >>> +1 >>>>> > >>> >>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >>>>> > >>> >>>>> > >>>> +1, LGTM. >>>>> > >>>> >>>>> > >>>> Kent >>>>> > >>>> >>>>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>>>> > >>>> >>>>> > >>>>> +1. Super excited by this initiative! >>>>> > >>>>> >>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang < >>>>> yblia...@gmail.com> >>>>> > >>>>> wrote: >>>>> > >>>>> >>>>> > >>>>>> +1 >>>>> > >>>>>> >>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao < >>>>> huaxin.ga...@gmail.com> >>>>> > >>>>>> wrote: >>>>> > >>>>>> >>>>> > >>>>>>> +1 >>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can >>>>> > >>>>>>> eliminate the need for separate streaming engines, reducing >>>>> system >>>>> > >>>>>>> complexity and operational cost. Excited to see this >>>>> direction! >>>>> > >>>>>>> >>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>> > >>>>>>> mich.talebza...@gmail.com> wrote: >>>>> > >>>>>>> >>>>> > >>>>>>>> Hi, >>>>> > >>>>>>>> >>>>> > >>>>>>>> My point about "in real time application or data, there is >>>>> nothing >>>>> > >>>>>>>> as an answer which is supposed to be late and correct. The >>>>> timeliness is >>>>> > >>>>>>>> part of the application. if I get the right answer too >>>>> slowly it becomes >>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need >>>>> this >>>>> > >>>>>>>> Spark Structured Streaming proposal. >>>>> > >>>>>>>> >>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power >>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of the >>>>> answer >>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current >>>>> streaming >>>>> > >>>>>>>> engine, primarily operating on micro-batches, often >>>>> delivers results that >>>>> > >>>>>>>> are technically "correct" but arrive too late to be truly >>>>> useful for >>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them >>>>> "useless or >>>>> > >>>>>>>> wrong" in a practical, business-critical sense. >>>>> > >>>>>>>> >>>>> > >>>>>>>> For example *in real-time fraud detection* and In >>>>> *high-frequency >>>>> > >>>>>>>> trading,* market data or trade execution commands must be >>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can >>>>> mean missed >>>>> > >>>>>>>> opportunities or significant financial losses, making a >>>>> "correct" price >>>>> > >>>>>>>> update useless if it's not instantaneous. able for these >>>>> demanding >>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not >>>>> good enough. As >>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be >>>>> treated as such not >>>>> > >>>>>>>> as a comment.in SPIP >>>>> > >>>>>>>> >>>>> > >>>>>>>> Hope this clarifies the connection in practical terms >>>>> > >>>>>>>> Dr Mich Talebzadeh, >>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>> Analysis | >>>>> > >>>>>>>> GDPR >>>>> > >>>>>>>> >>>>> > >>>>>>>> view my Linkedin profile >>>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee < >>>>> denny.g....@gmail.com> >>>>> > >>>>>>>> wrote: >>>>> > >>>>>>>> >>>>> > >>>>>>>>> Hey Mich, >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your >>>>> > >>>>>>>>> definition here have to do with the SPIP? Perhaps add >>>>> comments directly >>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below >>>>> is a direct copy >>>>> > >>>>>>>>> from the SPIP itself. >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> Thanks, >>>>> > >>>>>>>>> Denny >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>> > >>>>>>>>> >>>>> > >>>>>>>>>> just to add >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> A stronger definition of real time. The engineering >>>>> definition of >>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> However, I put a stronger definition. In real time >>>>> application or >>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to >>>>> be late and >>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I >>>>> get the right >>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> Dr Mich Talebzadeh, >>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>> Analysis | >>>>> > >>>>>>>>>> GDPR >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> view my Linkedin profile >>>>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ >>>>> > >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>>> The current limitations in SSS come from >>>>> micro-batching.If you >>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must >>>>> be balanced against >>>>> > >>>>>>>>>>> the available processing capacity of the cluster to >>>>> prevent back pressure >>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing >>>>> mode, a >>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint >>>>> interval quote >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> " >>>>> > >>>>>>>>>>> df.writeStream >>>>> > >>>>>>>>>>> .format("...") >>>>> > >>>>>>>>>>> .option("...") >>>>> > >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new >>>>> trigger >>>>> > >>>>>>>>>>> type to enable real-time Mode >>>>> > >>>>>>>>>>> .start() >>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run >>>>> in the >>>>> > >>>>>>>>>>> new ultra low-latency execution mode. A time interval >>>>> can also be >>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each >>>>> micro-batch should >>>>> > >>>>>>>>>>> run for. >>>>> > >>>>>>>>>>> " >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple >>>>> > >>>>>>>>>>> HTH >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> Dr Mich Talebzadeh, >>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>> Analysis | >>>>> > >>>>>>>>>>> GDPR >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> view my Linkedin profile >>>>> > >>>>>>>>>>> < >>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>>> Hi all, >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” >>>>> that I've been >>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, >>>>> Jungtaek Lim, and >>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA >>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] >>>>> [Doc >>>>> > >>>>>>>>>>>> < >>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing >>>>> > >>>>> > >>>>>>>>>>>> ]. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called >>>>> “Real-time Mode” >>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers >>>>> end-to-end latency >>>>> > >>>>>>>>>>>> for processing streams of data. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our >>>>> goal is >>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that >>>>> need results almost >>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to >>>>> achieve this without >>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that >>>>> users already use – so >>>>> > >>>>>>>>>>>> existing streaming queries can run in this new >>>>> ultra-low-latency mode by >>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power >>>>> real-time >>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live >>>>> personalization) that >>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with >>>>> Spark’s current streaming >>>>> > >>>>>>>>>>>> engine. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and >>>>> > >>>>>>>>>>>> suggestions on this approach! >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> >>>>> > >>>>>> >>>>> > >>>>>> -- >>>>> > >>>>>> Best, >>>>> > >>>>>> Yanbo >>>>> > >>>>>> >>>>> > >>>>> >>>>> > >> >>>>> > >> -- >>>>> > >> John Zhuge >>>>> > >> >>>>> > >> >>>>> > >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>>