It should not be assumed. In something called "real-time", it should be very explicit what clock-time constraints are and are not guaranteed.
On Thu, May 29, 2025 at 10:00 PM Jerry Peng <jerry.boyang.p...@gmail.com> wrote: > It was kind of hard to see what mich's point was in the plethora of > emails he sent :) > > In embedded systems, there is a concept of soft real-time and hard > real-time. For these stream processing systems built for big data > analytics, it is assumed that we are talking about soft real-time. Sure, > there can be an argument as to why the name of this mode is not "low > latency mode" but I honestly don't like naming. This naming implies the > existing execution modes are not low latency which is not true. What > defines "low" in low latency? It is kind of relative. That is why the name > Real-time mode is selected. > > On Thu, May 29, 2025 at 8:57 PM Mark Hamstra <markhams...@gmail.com> > wrote: > >> I think you are missing his point. There is a fundamental difference >> between low-latency computation and real-time computing. Is what is >> described in the SPIP intended to provide results with real-time >> guarantees, or is it a misnamed effort to achieve low-latency? >> >> On Thu, May 29, 2025 at 5:54 PM Jerry Peng <jerry.boyang.p...@gmail.com> >> wrote: >> >>> Mich, >>> >>> Thank you for chiming in and providing insights into the importance of >>> not only getting correct results but also timely results. You are >>> absolutely right that the reason why something like Real-time Mode is >>> valuable is its ability to provide timely results for certain use cases >>> that require users to react super quickly to data. I can emphasize this >>> point in the SPIP. >>> >>> >>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> I think from what I have seen there are a good number of +1 responses >>>> as opposed to quantitative discussions (based on my observations only). >>>> Given the objectives of the thread, we ought to focus on what is meant by >>>> real time compared to continuous modes.To be fair, it is a common point >>>> of confusion, and the terms are often used interchangeably in general >>>> conversation, but in technical contexts, especially with streaming data >>>> platforms, they have specific and important differences. >>>> >>>> "Continuous Mode" refers to a processing strategy that aims for true, >>>> uninterrupted, sub-millisecond latency processing. Chiefly >>>> >>>> - Event-at-a-Time (or very small batch groups): The system >>>> processes individual events or extremely small groups of events -> >>>> microbatches as they flow through the pipeline. >>>> - Minimal Latency: The primary goal is to achieve the absolute >>>> lowest possible end-to-end latency, often in the order of milliseconds >>>> or >>>> even below >>>> - Most business use cases (say financial markets) can live with >>>> this as they do not rely on rdges >>>> >>>> Now what is meant by "Real-time Mode" >>>> >>>> This is where the nuance comes in. "Real-time" is a broader and >>>> sometimes more subjective term. When the text introduces "Real-time Mode" >>>> as distinct from "Continuous Mode," it suggests a specific implementation >>>> that achieves real-time characteristics but might do so differently or more >>>> robustly than a "continuous" mode attempt. Going back to my earlier >>>> mention, in real time application , there is nothing as an answer which is >>>> supposed to be late and correct. The timeliness is part of the application. >>>> if I get the right answer too slowly it becomes useless or wrong. What I >>>> call the "Late and Correct is Useless" Principle >>>> >>>> In summary, "Real-time Mode" seems to describe an approach that >>>> delivers low-latency processing with high reliability and ease of use, >>>> leveraging established, battle-tested components.I invite the audience to >>>> have a discussion on this. >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh, >>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote: >>>> >>>>> +1 >>>>> >>>>> On 2025/05/29 16:25:19 Xiao Li wrote: >>>>> > +1 >>>>> > >>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道: >>>>> > >>>>> > > +1. >>>>> > > >>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote: >>>>> > > >>>>> > >> +1 >>>>> > >> Sent from my iPhone >>>>> > >> >>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> >>>>> wrote: >>>>> > >> >>>>> > >> >>>>> > >> +1 Nice feature >>>>> > >> >>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li < >>>>> xyliyuanj...@gmail.com> >>>>> > >> wrote: >>>>> > >> >>>>> > >>> +1 >>>>> > >>> >>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >>>>> > >>> >>>>> > >>>> +1, LGTM. >>>>> > >>>> >>>>> > >>>> Kent >>>>> > >>>> >>>>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>>>> > >>>> >>>>> > >>>>> +1. Super excited by this initiative! >>>>> > >>>>> >>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang < >>>>> yblia...@gmail.com> >>>>> > >>>>> wrote: >>>>> > >>>>> >>>>> > >>>>>> +1 >>>>> > >>>>>> >>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao < >>>>> huaxin.ga...@gmail.com> >>>>> > >>>>>> wrote: >>>>> > >>>>>> >>>>> > >>>>>>> +1 >>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can >>>>> > >>>>>>> eliminate the need for separate streaming engines, reducing >>>>> system >>>>> > >>>>>>> complexity and operational cost. Excited to see this >>>>> direction! >>>>> > >>>>>>> >>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>> > >>>>>>> mich.talebza...@gmail.com> wrote: >>>>> > >>>>>>> >>>>> > >>>>>>>> Hi, >>>>> > >>>>>>>> >>>>> > >>>>>>>> My point about "in real time application or data, there is >>>>> nothing >>>>> > >>>>>>>> as an answer which is supposed to be late and correct. The >>>>> timeliness is >>>>> > >>>>>>>> part of the application. if I get the right answer too >>>>> slowly it becomes >>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need >>>>> this >>>>> > >>>>>>>> Spark Structured Streaming proposal. >>>>> > >>>>>>>> >>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power >>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of the >>>>> answer >>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current >>>>> streaming >>>>> > >>>>>>>> engine, primarily operating on micro-batches, often >>>>> delivers results that >>>>> > >>>>>>>> are technically "correct" but arrive too late to be truly >>>>> useful for >>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them >>>>> "useless or >>>>> > >>>>>>>> wrong" in a practical, business-critical sense. >>>>> > >>>>>>>> >>>>> > >>>>>>>> For example *in real-time fraud detection* and In >>>>> *high-frequency >>>>> > >>>>>>>> trading,* market data or trade execution commands must be >>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can >>>>> mean missed >>>>> > >>>>>>>> opportunities or significant financial losses, making a >>>>> "correct" price >>>>> > >>>>>>>> update useless if it's not instantaneous. able for these >>>>> demanding >>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not >>>>> good enough. As >>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be >>>>> treated as such not >>>>> > >>>>>>>> as a comment.in SPIP >>>>> > >>>>>>>> >>>>> > >>>>>>>> Hope this clarifies the connection in practical terms >>>>> > >>>>>>>> Dr Mich Talebzadeh, >>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>> Analysis | >>>>> > >>>>>>>> GDPR >>>>> > >>>>>>>> >>>>> > >>>>>>>> view my Linkedin profile >>>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> >>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee < >>>>> denny.g....@gmail.com> >>>>> > >>>>>>>> wrote: >>>>> > >>>>>>>> >>>>> > >>>>>>>>> Hey Mich, >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your >>>>> > >>>>>>>>> definition here have to do with the SPIP? Perhaps add >>>>> comments directly >>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below >>>>> is a direct copy >>>>> > >>>>>>>>> from the SPIP itself. >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> Thanks, >>>>> > >>>>>>>>> Denny >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> >>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>> > >>>>>>>>> >>>>> > >>>>>>>>>> just to add >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> A stronger definition of real time. The engineering >>>>> definition of >>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> However, I put a stronger definition. In real time >>>>> application or >>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to >>>>> be late and >>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I >>>>> get the right >>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> Dr Mich Talebzadeh, >>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>> Analysis | >>>>> > >>>>>>>>>> GDPR >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> view my Linkedin profile >>>>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ >>>>> > >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>> > >>>>>>>>>> >>>>> > >>>>>>>>>>> The current limitations in SSS come from >>>>> micro-batching.If you >>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must >>>>> be balanced against >>>>> > >>>>>>>>>>> the available processing capacity of the cluster to >>>>> prevent back pressure >>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing >>>>> mode, a >>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint >>>>> interval quote >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> " >>>>> > >>>>>>>>>>> df.writeStream >>>>> > >>>>>>>>>>> .format("...") >>>>> > >>>>>>>>>>> .option("...") >>>>> > >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new >>>>> trigger >>>>> > >>>>>>>>>>> type to enable real-time Mode >>>>> > >>>>>>>>>>> .start() >>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run >>>>> in the >>>>> > >>>>>>>>>>> new ultra low-latency execution mode. A time interval >>>>> can also be >>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each >>>>> micro-batch should >>>>> > >>>>>>>>>>> run for. >>>>> > >>>>>>>>>>> " >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple >>>>> > >>>>>>>>>>> HTH >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> Dr Mich Talebzadeh, >>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>> Analysis | >>>>> > >>>>>>>>>>> GDPR >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> view my Linkedin profile >>>>> > >>>>>>>>>>> < >>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>> > >>>>>>>>>>> >>>>> > >>>>>>>>>>>> Hi all, >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” >>>>> that I've been >>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, >>>>> Jungtaek Lim, and >>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA >>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] >>>>> [Doc >>>>> > >>>>>>>>>>>> < >>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing >>>>> > >>>>> > >>>>>>>>>>>> ]. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called >>>>> “Real-time Mode” >>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers >>>>> end-to-end latency >>>>> > >>>>>>>>>>>> for processing streams of data. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our >>>>> goal is >>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that >>>>> need results almost >>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to >>>>> achieve this without >>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that >>>>> users already use – so >>>>> > >>>>>>>>>>>> existing streaming queries can run in this new >>>>> ultra-low-latency mode by >>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power >>>>> real-time >>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live >>>>> personalization) that >>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with >>>>> Spark’s current streaming >>>>> > >>>>>>>>>>>> engine. >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and >>>>> > >>>>>>>>>>>> suggestions on this approach! >>>>> > >>>>>>>>>>>> >>>>> > >>>>>>>>>>>> >>>>> > >>>>>> >>>>> > >>>>>> -- >>>>> > >>>>>> Best, >>>>> > >>>>>> Yanbo >>>>> > >>>>>> >>>>> > >>>>> >>>>> > >> >>>>> > >> -- >>>>> > >> John Zhuge >>>>> > >> >>>>> > >> >>>>> > >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>>