Referencing other misuse of "real-time" is not persuasive. A SPIP is an engineering document, not a marketing document. Technical clarity and accuracy should be non-negotiable.
On Thu, May 29, 2025 at 10:27 PM Jerry Peng <jerry.boyang.p...@gmail.com> wrote: > Mark, > > As an example of my point if you go the the Apache Storm (another stream > processing engine) website: > > https://storm.apache.org/ > > It describes Storm as: > > "Apache Storm is a free and open source distributed *realtime* > computation system" > > If you can to apache Flink: > > > https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/ > > "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing" > > Thus, what the term "rea-time" implies in this should not be confusing for > folks in this area. > > On Thu, May 29, 2025 at 10:22 PM Jerry Peng <jerry.boyang.p...@gmail.com> > wrote: > >> Mich, >> >> If I understood your last email correctly, I think you also wanted to >> have a discussion about naming? Why are we calling this new execution mode >> described in the SPIP "Real-time Mode"? Here are my two cents. Firstly, >> "continuous mode" is taken and we want another name to describe an >> execution mode that provides ultra low latency processing. We could have >> called it "low latency mode", though I don't really like that naming since >> it implies the other execution modes are not low latency which I don't >> believe is true. This new proposed mode can simply deliver even lower >> latency. Thus, we came up with the name "Real-time Mode". Of course, we >> are talking about "soft" real-time here. I think when we are talking about >> distributed stream processing systems in the space of big data analytics, >> it is reasonable to assume anything described in this space as "real-time" >> implies "soft" real-time. Though if this is confusing or misleading, we >> can provide clear documentation on what "real-time" in real-time mode means >> and what it guarantees. Just my thoughts. I would love to hear other >> perspectives. >> >> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> I think from what I have seen there are a good number of +1 responses as >>> opposed to quantitative discussions (based on my observations only). Given >>> the objectives of the thread, we ought to focus on what is meant by real >>> time compared to continuous modes.To be fair, it is a common point of >>> confusion, and the terms are often used interchangeably in general >>> conversation, but in technical contexts, especially with streaming data >>> platforms, they have specific and important differences. >>> >>> "Continuous Mode" refers to a processing strategy that aims for true, >>> uninterrupted, sub-millisecond latency processing. Chiefly >>> >>> - Event-at-a-Time (or very small batch groups): The system >>> processes individual events or extremely small groups of events -> >>> microbatches as they flow through the pipeline. >>> - Minimal Latency: The primary goal is to achieve the absolute >>> lowest possible end-to-end latency, often in the order of milliseconds or >>> even below >>> - Most business use cases (say financial markets) can live with this >>> as they do not rely on rdges >>> >>> Now what is meant by "Real-time Mode" >>> >>> This is where the nuance comes in. "Real-time" is a broader and >>> sometimes more subjective term. When the text introduces "Real-time Mode" >>> as distinct from "Continuous Mode," it suggests a specific implementation >>> that achieves real-time characteristics but might do so differently or more >>> robustly than a "continuous" mode attempt. Going back to my earlier >>> mention, in real time application , there is nothing as an answer which is >>> supposed to be late and correct. The timeliness is part of the application. >>> if I get the right answer too slowly it becomes useless or wrong. What I >>> call the "Late and Correct is Useless" Principle >>> >>> In summary, "Real-time Mode" seems to describe an approach that delivers >>> low-latency processing with high reliability and ease of use, leveraging >>> established, battle-tested components.I invite the audience to have a >>> discussion on this. >>> >>> HTH >>> >>> Dr Mich Talebzadeh, >>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> >>> >>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote: >>> >>>> +1 >>>> >>>> On 2025/05/29 16:25:19 Xiao Li wrote: >>>> > +1 >>>> > >>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道: >>>> > >>>> > > +1. >>>> > > >>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote: >>>> > > >>>> > >> +1 >>>> > >> Sent from my iPhone >>>> > >> >>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> >>>> wrote: >>>> > >> >>>> > >> >>>> > >> +1 Nice feature >>>> > >> >>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li < >>>> xyliyuanj...@gmail.com> >>>> > >> wrote: >>>> > >> >>>> > >>> +1 >>>> > >>> >>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >>>> > >>> >>>> > >>>> +1, LGTM. >>>> > >>>> >>>> > >>>> Kent >>>> > >>>> >>>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>>> > >>>> >>>> > >>>>> +1. Super excited by this initiative! >>>> > >>>>> >>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com >>>> > >>>> > >>>>> wrote: >>>> > >>>>> >>>> > >>>>>> +1 >>>> > >>>>>> >>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao < >>>> huaxin.ga...@gmail.com> >>>> > >>>>>> wrote: >>>> > >>>>>> >>>> > >>>>>>> +1 >>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can >>>> > >>>>>>> eliminate the need for separate streaming engines, reducing >>>> system >>>> > >>>>>>> complexity and operational cost. Excited to see this >>>> direction! >>>> > >>>>>>> >>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>> > >>>>>>> mich.talebza...@gmail.com> wrote: >>>> > >>>>>>> >>>> > >>>>>>>> Hi, >>>> > >>>>>>>> >>>> > >>>>>>>> My point about "in real time application or data, there is >>>> nothing >>>> > >>>>>>>> as an answer which is supposed to be late and correct. The >>>> timeliness is >>>> > >>>>>>>> part of the application. if I get the right answer too >>>> slowly it becomes >>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need >>>> this >>>> > >>>>>>>> Spark Structured Streaming proposal. >>>> > >>>>>>>> >>>> > >>>>>>>> The proposal is precisely about enabling Spark to power >>>> > >>>>>>>> applications where, as I define it, the *timeliness* of the >>>> answer >>>> > >>>>>>>> is as critical as its *correctness*. Spark's current >>>> streaming >>>> > >>>>>>>> engine, primarily operating on micro-batches, often delivers >>>> results that >>>> > >>>>>>>> are technically "correct" but arrive too late to be truly >>>> useful for >>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them >>>> "useless or >>>> > >>>>>>>> wrong" in a practical, business-critical sense. >>>> > >>>>>>>> >>>> > >>>>>>>> For example *in real-time fraud detection* and In >>>> *high-frequency >>>> > >>>>>>>> trading,* market data or trade execution commands must be >>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can mean >>>> missed >>>> > >>>>>>>> opportunities or significant financial losses, making a >>>> "correct" price >>>> > >>>>>>>> update useless if it's not instantaneous. able for these >>>> demanding >>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not >>>> good enough. As >>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be >>>> treated as such not >>>> > >>>>>>>> as a comment.in SPIP >>>> > >>>>>>>> >>>> > >>>>>>>> Hope this clarifies the connection in practical terms >>>> > >>>>>>>> Dr Mich Talebzadeh, >>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>> Analysis | >>>> > >>>>>>>> GDPR >>>> > >>>>>>>> >>>> > >>>>>>>> view my Linkedin profile >>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee < >>>> denny.g....@gmail.com> >>>> > >>>>>>>> wrote: >>>> > >>>>>>>> >>>> > >>>>>>>>> Hey Mich, >>>> > >>>>>>>>> >>>> > >>>>>>>>> Sorry, I may be missing something here but what does your >>>> > >>>>>>>>> definition here have to do with the SPIP? Perhaps add >>>> comments directly >>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below is >>>> a direct copy >>>> > >>>>>>>>> from the SPIP itself. >>>> > >>>>>>>>> >>>> > >>>>>>>>> Thanks, >>>> > >>>>>>>>> Denny >>>> > >>>>>>>>> >>>> > >>>>>>>>> >>>> > >>>>>>>>> >>>> > >>>>>>>>> >>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>> > >>>>>>>>> >>>> > >>>>>>>>>> just to add >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> A stronger definition of real time. The engineering >>>> definition of >>>> > >>>>>>>>>> real time is roughly fast enough to be interactive >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> However, I put a stronger definition. In real time >>>> application or >>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to >>>> be late and >>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I >>>> get the right >>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> Dr Mich Talebzadeh, >>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>> Analysis | >>>> > >>>>>>>>>> GDPR >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> view my Linkedin profile >>>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> >>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>> > >>>>>>>>>> >>>> > >>>>>>>>>>> The current limitations in SSS come from >>>> micro-batching.If you >>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must >>>> be balanced against >>>> > >>>>>>>>>>> the available processing capacity of the cluster to >>>> prevent back pressure >>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing >>>> mode, a >>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint >>>> interval quote >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> " >>>> > >>>>>>>>>>> df.writeStream >>>> > >>>>>>>>>>> .format("...") >>>> > >>>>>>>>>>> .option("...") >>>> > >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new >>>> trigger >>>> > >>>>>>>>>>> type to enable real-time Mode >>>> > >>>>>>>>>>> .start() >>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run >>>> in the >>>> > >>>>>>>>>>> new ultra low-latency execution mode. A time interval >>>> can also be >>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each >>>> micro-batch should >>>> > >>>>>>>>>>> run for. >>>> > >>>>>>>>>>> " >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple >>>> > >>>>>>>>>>> HTH >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> Dr Mich Talebzadeh, >>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>> Analysis | >>>> > >>>>>>>>>>> GDPR >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> view my Linkedin profile >>>> > >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ >>>> > >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>> > >>>>>>>>>>> >>>> > >>>>>>>>>>>> Hi all, >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” >>>> that I've been >>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, >>>> Jungtaek Lim, and >>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA >>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] >>>> [Doc >>>> > >>>>>>>>>>>> < >>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing >>>> > >>>> > >>>>>>>>>>>> ]. >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time >>>> Mode” >>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers >>>> end-to-end latency >>>> > >>>>>>>>>>>> for processing streams of data. >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our >>>> goal is >>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that >>>> need results almost >>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to >>>> achieve this without >>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users >>>> already use – so >>>> > >>>>>>>>>>>> existing streaming queries can run in this new >>>> ultra-low-latency mode by >>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic. >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time >>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live >>>> personalization) that >>>> > >>>>>>>>>>>> today cannot meet their latency requirements with >>>> Spark’s current streaming >>>> > >>>>>>>>>>>> engine. >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and >>>> > >>>>>>>>>>>> suggestions on this approach! >>>> > >>>>>>>>>>>> >>>> > >>>>>>>>>>>> >>>> > >>>>>> >>>> > >>>>>> -- >>>> > >>>>>> Best, >>>> > >>>>>> Yanbo >>>> > >>>>>> >>>> > >>>>> >>>> > >> >>>> > >> -- >>>> > >> John Zhuge >>>> > >> >>>> > >> >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>>