Clarifying what is meant by "real-time" and explicitly differentiating it from actual real-time computing should be a bare minimum. I still don't like the use of marketing-speak "real-time" that isn't really real-time in engineering documents or API namespaces.
On Thu, May 29, 2025 at 10:43 PM Jerry Peng <jerry.boyang.p...@gmail.com> wrote: > Mark, > > I thought we are simply discussing the naming of the mode? Like I > mentioned, if you think simply calling this mode "real-time" mode may cause > confusion because "real-time" can mean other things in other fields, I can > clarify what we mean by "real-time" explicitly in the SPIP document and any > future documentation. That is not a problem and thank you for your feedback. > > On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <markhams...@gmail.com> > wrote: > >> Referencing other misuse of "real-time" is not persuasive. A SPIP is an >> engineering document, not a marketing document. Technical clarity and >> accuracy should be non-negotiable. >> >> >> On Thu, May 29, 2025 at 10:27 PM Jerry Peng <jerry.boyang.p...@gmail.com> >> wrote: >> >>> Mark, >>> >>> As an example of my point if you go the the Apache Storm (another stream >>> processing engine) website: >>> >>> https://storm.apache.org/ >>> >>> It describes Storm as: >>> >>> "Apache Storm is a free and open source distributed *realtime* >>> computation system" >>> >>> If you can to apache Flink: >>> >>> >>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/ >>> >>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing" >>> >>> Thus, what the term "rea-time" implies in this should not be confusing >>> for folks in this area. >>> >>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <jerry.boyang.p...@gmail.com> >>> wrote: >>> >>>> Mich, >>>> >>>> If I understood your last email correctly, I think you also wanted to >>>> have a discussion about naming? Why are we calling this new execution mode >>>> described in the SPIP "Real-time Mode"? Here are my two cents. Firstly, >>>> "continuous mode" is taken and we want another name to describe an >>>> execution mode that provides ultra low latency processing. We could have >>>> called it "low latency mode", though I don't really like that naming since >>>> it implies the other execution modes are not low latency which I don't >>>> believe is true. This new proposed mode can simply deliver even lower >>>> latency. Thus, we came up with the name "Real-time Mode". Of course, we >>>> are talking about "soft" real-time here. I think when we are talking about >>>> distributed stream processing systems in the space of big data analytics, >>>> it is reasonable to assume anything described in this space as "real-time" >>>> implies "soft" real-time. Though if this is confusing or misleading, we >>>> can provide clear documentation on what "real-time" in real-time mode means >>>> and what it guarantees. Just my thoughts. I would love to hear other >>>> perspectives. >>>> >>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> I think from what I have seen there are a good number of +1 responses >>>>> as opposed to quantitative discussions (based on my observations only). >>>>> Given the objectives of the thread, we ought to focus on what is meant by >>>>> real time compared to continuous modes.To be fair, it is a common point >>>>> of confusion, and the terms are often used interchangeably in general >>>>> conversation, but in technical contexts, especially with streaming data >>>>> platforms, they have specific and important differences. >>>>> >>>>> "Continuous Mode" refers to a processing strategy that aims for true, >>>>> uninterrupted, sub-millisecond latency processing. Chiefly >>>>> >>>>> - Event-at-a-Time (or very small batch groups): The system >>>>> processes individual events or extremely small groups of events -> >>>>> microbatches as they flow through the pipeline. >>>>> - Minimal Latency: The primary goal is to achieve the absolute >>>>> lowest possible end-to-end latency, often in the order of milliseconds >>>>> or >>>>> even below >>>>> - Most business use cases (say financial markets) can live with >>>>> this as they do not rely on rdges >>>>> >>>>> Now what is meant by "Real-time Mode" >>>>> >>>>> This is where the nuance comes in. "Real-time" is a broader and >>>>> sometimes more subjective term. When the text introduces "Real-time Mode" >>>>> as distinct from "Continuous Mode," it suggests a specific implementation >>>>> that achieves real-time characteristics but might do so differently or >>>>> more >>>>> robustly than a "continuous" mode attempt. Going back to my earlier >>>>> mention, in real time application , there is nothing as an answer which is >>>>> supposed to be late and correct. The timeliness is part of the >>>>> application. >>>>> if I get the right answer too slowly it becomes useless or wrong. What I >>>>> call the "Late and Correct is Useless" Principle >>>>> >>>>> In summary, "Real-time Mode" seems to describe an approach that >>>>> delivers low-latency processing with high reliability and ease of use, >>>>> leveraging established, battle-tested components.I invite the audience to >>>>> have a discussion on this. >>>>> >>>>> HTH >>>>> >>>>> Dr Mich Talebzadeh, >>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> On 2025/05/29 16:25:19 Xiao Li wrote: >>>>>> > +1 >>>>>> > >>>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道: >>>>>> > >>>>>> > > +1. >>>>>> > > >>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> >>>>>> wrote: >>>>>> > > >>>>>> > >> +1 >>>>>> > >> Sent from my iPhone >>>>>> > >> >>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> >>>>>> wrote: >>>>>> > >> >>>>>> > >> >>>>>> > >> +1 Nice feature >>>>>> > >> >>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li < >>>>>> xyliyuanj...@gmail.com> >>>>>> > >> wrote: >>>>>> > >> >>>>>> > >>> +1 >>>>>> > >>> >>>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >>>>>> > >>> >>>>>> > >>>> +1, LGTM. >>>>>> > >>>> >>>>>> > >>>> Kent >>>>>> > >>>> >>>>>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>>>>> > >>>> >>>>>> > >>>>> +1. Super excited by this initiative! >>>>>> > >>>>> >>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang < >>>>>> yblia...@gmail.com> >>>>>> > >>>>> wrote: >>>>>> > >>>>> >>>>>> > >>>>>> +1 >>>>>> > >>>>>> >>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao < >>>>>> huaxin.ga...@gmail.com> >>>>>> > >>>>>> wrote: >>>>>> > >>>>>> >>>>>> > >>>>>>> +1 >>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we can >>>>>> > >>>>>>> eliminate the need for separate streaming engines, reducing >>>>>> system >>>>>> > >>>>>>> complexity and operational cost. Excited to see this >>>>>> direction! >>>>>> > >>>>>>> >>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>>> > >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>> > >>>>>>> >>>>>> > >>>>>>>> Hi, >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> My point about "in real time application or data, there is >>>>>> nothing >>>>>> > >>>>>>>> as an answer which is supposed to be late and correct. The >>>>>> timeliness is >>>>>> > >>>>>>>> part of the application. if I get the right answer too >>>>>> slowly it becomes >>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need >>>>>> this >>>>>> > >>>>>>>> Spark Structured Streaming proposal. >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power >>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of >>>>>> the answer >>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current >>>>>> streaming >>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often >>>>>> delivers results that >>>>>> > >>>>>>>> are technically "correct" but arrive too late to be truly >>>>>> useful for >>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them >>>>>> "useless or >>>>>> > >>>>>>>> wrong" in a practical, business-critical sense. >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> For example *in real-time fraud detection* and In >>>>>> *high-frequency >>>>>> > >>>>>>>> trading,* market data or trade execution commands must be >>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can >>>>>> mean missed >>>>>> > >>>>>>>> opportunities or significant financial losses, making a >>>>>> "correct" price >>>>>> > >>>>>>>> update useless if it's not instantaneous. able for these >>>>>> demanding >>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply not >>>>>> good enough. As >>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be >>>>>> treated as such not >>>>>> > >>>>>>>> as a comment.in SPIP >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms >>>>>> > >>>>>>>> Dr Mich Talebzadeh, >>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>>> Analysis | >>>>>> > >>>>>>>> GDPR >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> view my Linkedin profile >>>>>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> >>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee < >>>>>> denny.g....@gmail.com> >>>>>> > >>>>>>>> wrote: >>>>>> > >>>>>>>> >>>>>> > >>>>>>>>> Hey Mich, >>>>>> > >>>>>>>>> >>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your >>>>>> > >>>>>>>>> definition here have to do with the SPIP? Perhaps add >>>>>> comments directly >>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet below >>>>>> is a direct copy >>>>>> > >>>>>>>>> from the SPIP itself. >>>>>> > >>>>>>>>> >>>>>> > >>>>>>>>> Thanks, >>>>>> > >>>>>>>>> Denny >>>>>> > >>>>>>>>> >>>>>> > >>>>>>>>> >>>>>> > >>>>>>>>> >>>>>> > >>>>>>>>> >>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>> > >>>>>>>>> >>>>>> > >>>>>>>>>> just to add >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering >>>>>> definition of >>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time >>>>>> application or >>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to >>>>>> be late and >>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if I >>>>>> get the right >>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> Dr Mich Talebzadeh, >>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>>> Analysis | >>>>>> > >>>>>>>>>> GDPR >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> view my Linkedin profile >>>>>> > >>>>>>>>>> < >>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>> > >>>>>>>>>> >>>>>> > >>>>>>>>>>> The current limitations in SSS come from >>>>>> micro-batching.If you >>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must >>>>>> be balanced against >>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to >>>>>> prevent back pressure >>>>>> > >>>>>>>>>>> and instability. In the case of Continuous Processing >>>>>> mode, a >>>>>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint >>>>>> interval quote >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> " >>>>>> > >>>>>>>>>>> df.writeStream >>>>>> > >>>>>>>>>>> .format("...") >>>>>> > >>>>>>>>>>> .option("...") >>>>>> > >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new >>>>>> trigger >>>>>> > >>>>>>>>>>> type to enable real-time Mode >>>>>> > >>>>>>>>>>> .start() >>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run >>>>>> in the >>>>>> > >>>>>>>>>>> new ultra low-latency execution mode. A time interval >>>>>> can also be >>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long >>>>>> each micro-batch should >>>>>> > >>>>>>>>>>> run for. >>>>>> > >>>>>>>>>>> " >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple >>>>>> > >>>>>>>>>>> HTH >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh, >>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic >>>>>> Analysis | >>>>>> > >>>>>>>>>>> GDPR >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> view my Linkedin profile >>>>>> > >>>>>>>>>>> < >>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>> > >>>>>>>>>>> >>>>>> > >>>>>>>>>>>> Hi all, >>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” >>>>>> that I've been >>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, >>>>>> Jungtaek Lim, and >>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA >>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] >>>>>> [Doc >>>>>> > >>>>>>>>>>>> < >>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing >>>>>> > >>>>>> > >>>>>>>>>>>> ]. >>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called >>>>>> “Real-time Mode” >>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly >>>>>> lowers end-to-end latency >>>>>> > >>>>>>>>>>>> for processing streams of data. >>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our >>>>>> goal is >>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that >>>>>> need results almost >>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to >>>>>> achieve this without >>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that >>>>>> users already use – so >>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new >>>>>> ultra-low-latency mode by >>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic. >>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power >>>>>> real-time >>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live >>>>>> personalization) that >>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with >>>>>> Spark’s current streaming >>>>>> > >>>>>>>>>>>> engine. >>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and >>>>>> > >>>>>>>>>>>> suggestions on this approach! >>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>>>>>>>> >>>>>> > >>>>>> >>>>>> > >>>>>> -- >>>>>> > >>>>>> Best, >>>>>> > >>>>>> Yanbo >>>>>> > >>>>>> >>>>>> > >>>>> >>>>>> > >> >>>>>> > >> -- >>>>>> > >> John Zhuge >>>>>> > >> >>>>>> > >> >>>>>> > >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>>