Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

DB Tsai Thu, 29 May 2025 00:38:14 -0700

Sent from my iPhone

On May 29, 2025, at 12:15 AM, John Zhuge <[email protected]> wrote:

+1 Nice feature

On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <[email protected]> wrote:
+1

Kent Yao <[email protected]> 于2025年5月28日周三 19:31写道：
+1, LGTM.

Kent

在 2025年5月29日星期四，Chao Sun <[email protected]> 写道：
+1. Super excited by this initiative!

On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <[email protected]> wrote:
+1

On Wed, May 28, 2025 at 12:34 PM huaxin gao <[email protected]> wrote:
+1
By unifying batch and low-latency streaming in Spark, we can eliminate the need for separate streaming engines, reducing system complexity and operational cost. Excited to see this direction!

On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <[email protected]> wrote:
Hi,
My point about "in real time application or data, there is nothing as an answer which is supposed to be late and correct. The timeliness is part of the application. if I get the right answer too slowly it becomes useless or wrong" is actually fundamental to why we need this Spark Structured Streaming proposal.
The proposal is precisely about enabling Spark to power applications where, as I define it, the timeliness of the answer is as critical as its correctness. Spark's current streaming engine, primarily operating on micro-batches, often delivers results that are technically "correct" but arrive too late to be truly useful for certain high-stakes, real-time scenarios. This makes them "useless or wrong" in a practical, business-critical sense.
For example in real-time fraud detection and In high-frequency trading, market data or trade execution commands must be delivered with minimal latency. Even a slight delay can mean missed opportunities or significant financial losses, making a "correct" price update useless if it's not instantaneous. able for these demanding use cases, where a "late but correct" answer is simply not good enough. As a colliery it is a fundamental concept, so it has to be treated as such not as a comment.in SPIP
Hope this clarifies the connection in practical terms

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile

On Wed, 28 May 2025 at 16:32, Denny Lee <[email protected]> wrote:
Hey Mich,

Sorry, I may be missing something here but what does your definition here have to do with the SPIP? Perhaps add comments directly to the SPIP to provide context as the code snippet below is a direct copy from the SPIP itself.

Thanks,
Denny

On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <[email protected]> wrote:
just to add

A stronger definition of real time. The engineering definition of real time is roughly fast enough to be interactive

However, I put a stronger definition. In real time application or data, there is nothing as an answer which is supposed to be late and correct. The timeliness is part of the application.if I get the right answer too slowly it becomes useless or wrong

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile

On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <[email protected]> wrote:
The current limitations in SSS come from micro-batching.If you are going to reduce micro-batching, this reduction must be balanced against the available processing capacity of the cluster to prevent back pressure and instability. In the case of Continuous Processing mode, a specific continuous trigger with a desired checkpoint interval quote

"
df.writeStream
.format("...")
.option("...")
.trigger(Trigger.RealTime(“300 Seconds”)) // new trigger type to enable real-time Mode
.start()
This Trigger.RealTime signals that the query should run in the new ultra low-latency execution mode. A time interval can also be specified, e.g. “300 Seconds”, to indicate how long each micro-batch should run for.
"
will inevitably depend on many factors. Not that simple
HTH

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile

On Wed, 28 May 2025 at 05:13, Jerry Peng <[email protected]> wrote:
Hi all,

I want to start a discussion thread for the SPIP titled “Real-Time Mode in Apache Spark Structured Streaming” that I've been working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: [JIRA] [Doc].

The SPIP proposes a new execution mode called “Real-time Mode” in Spark Structured Streaming that significantly lowers end-to-end latency for processing streams of data.

A key principle of this proposal is compatibility. Our goal is to make Spark capable of handling streaming jobs that need results almost immediately (within O(100) milliseconds). We want to achieve this without changing the high-level DataFrame/Dataset API that users already use – so existing streaming queries can run in this new ultra-low-latency mode by simply turning it on, without rewriting their logic.

In short, we’re trying to enable Spark to power real-time applications (like instant anomaly alerts or live personalization) that today cannot meet their latency requirements with Spark’s current streaming engine.

We'd greatly appreciate your feedback, thoughts, and suggestions on this approach!

--
Best,
Yanbo

--
John Zhuge

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to