Hi, My point about "in real time application or data, there is nothing as an answer which is supposed to be late and correct. The timeliness is part of the application. if I get the right answer too slowly it becomes useless or wrong" is actually fundamental to *why* we need this Spark Structured Streaming proposal.
The proposal is precisely about enabling Spark to power applications where, as I define it, the *timeliness* of the answer is as critical as its *correctness*. Spark's current streaming engine, primarily operating on micro-batches, often delivers results that are technically "correct" but arrive too late to be truly useful for certain high-stakes, real-time scenarios. This makes them "useless or wrong" in a practical, business-critical sense. For example *in real-time fraud detection* and In *high-frequency trading,* market data or trade execution commands must be delivered with minimal latency. Even a slight delay can mean missed opportunities or significant financial losses, making a "correct" price update useless if it's not instantaneous. able for these demanding use cases, where a "late but correct" answer is simply not good enough. As a colliery it is a fundamental concept, so it has to be treated as such not as a comment.in SPIP Hope this clarifies the connection in practical terms Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote: > Hey Mich, > > Sorry, I may be missing something here but what does your definition here > have to do with the SPIP? Perhaps add comments directly to the SPIP to > provide context as the code snippet below is a direct copy from the SPIP > itself. > > Thanks, > Denny > > > > > On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> just to add >> >> A stronger definition of real time. The engineering definition of real >> time is roughly fast enough to be interactive >> >> However, I put a stronger definition. In real time application or data, >> there is nothing as an answer which is supposed to be late and correct. The >> timeliness is part of the application.if I get the right answer too slowly >> it becomes useless or wrong >> >> >> >> Dr Mich Talebzadeh, >> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> >> >> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> The current limitations in SSS come from micro-batching.If you are going >>> to reduce micro-batching, this reduction must be balanced against the >>> available processing capacity of the cluster to prevent back pressure and >>> instability. In the case of Continuous Processing mode, a specific >>> continuous trigger with a desired checkpoint interval quote >>> >>> " >>> df.writeStream >>> .format("...") >>> .option("...") >>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger type to >>> enable real-time Mode >>> .start() >>> This Trigger.RealTime signals that the query should run in the new ultra >>> low-latency execution mode. A time interval can also be specified, e.g. >>> “300 Seconds”, to indicate how long each micro-batch should run for. >>> " >>> >>> will inevitably depend on many factors. Not that simple >>> HTH >>> >>> >>> Dr Mich Talebzadeh, >>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> >>> >>> On Wed, 28 May 2025 at 05:13, Jerry Peng <jerry.boyang.p...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I want to start a discussion thread for the SPIP titled “Real-Time Mode >>>> in Apache Spark Structured Streaming” that I've been working on with Siying >>>> Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: [JIRA >>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>> ]. >>>> >>>> The SPIP proposes a new execution mode called “Real-time Mode” in Spark >>>> Structured Streaming that significantly lowers end-to-end latency for >>>> processing streams of data. >>>> >>>> A key principle of this proposal is compatibility. Our goal is to make >>>> Spark capable of handling streaming jobs that need results almost >>>> immediately (within O(100) milliseconds). We want to achieve this without >>>> changing the high-level DataFrame/Dataset API that users already use – so >>>> existing streaming queries can run in this new ultra-low-latency mode by >>>> simply turning it on, without rewriting their logic. >>>> >>>> In short, we’re trying to enable Spark to power real-time applications >>>> (like instant anomaly alerts or live personalization) that today cannot >>>> meet their latency requirements with Spark’s current streaming engine. >>>> >>>> We'd greatly appreciate your feedback, thoughts, and suggestions on >>>> this approach! >>>> >>>>