Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Yuming Wang Thu, 29 May 2025 02:22:28 -0700

+1.

On Thu, May 29, 2025 at 3:36 PM DB Tsai <[email protected]> wrote:


> +1
> Sent from my iPhone
>
> On May 29, 2025, at 12:15 AM, John Zhuge <[email protected]> wrote:
>
> 
> +1 Nice feature
>
> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <[email protected]>
> wrote:
>
>> +1
>>
>> Kent Yao <[email protected]> 于2025年5月28日周三 19:31写道：
>>
>>> +1, LGTM.
>>>
>>> Kent
>>>
>>> 在 2025年5月29日星期四，Chao Sun <[email protected]> 写道：
>>>
>>>> +1. Super excited by this initiative!
>>>>
>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <[email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>>>>> eliminate the need for separate streaming engines, reducing system
>>>>>> complexity and operational cost. Excited to see this direction!
>>>>>>
>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> My point about "in real time application or data, there is nothing
>>>>>>> as an answer which is supposed to be late and correct. The timeliness is
>>>>>>> part of the application. if I get the right answer too slowly it becomes
>>>>>>> useless or wrong" is actually fundamental to *why* we need this
>>>>>>> Spark Structured Streaming proposal.
>>>>>>>
>>>>>>> The proposal is precisely about enabling Spark to power applications
>>>>>>> where, as I define it, the *timeliness* of the answer is as
>>>>>>> critical as its *correctness*. Spark's current streaming engine,
>>>>>>> primarily operating on micro-batches, often delivers results that are
>>>>>>> technically "correct" but arrive too late to be truly useful for certain
>>>>>>> high-stakes, real-time scenarios. This makes them "useless or wrong" in 
>>>>>>> a
>>>>>>> practical, business-critical sense.
>>>>>>>
>>>>>>> For example *in real-time fraud detection* and In *high-frequency
>>>>>>> trading,* market data or trade execution commands must be delivered
>>>>>>> with minimal latency. Even a slight delay can mean missed opportunities 
>>>>>>> or
>>>>>>> significant financial losses, making a "correct" price update useless if
>>>>>>> it's not instantaneous. able for these demanding use cases, where a
>>>>>>> "late but correct" answer is simply not good enough. As a colliery it 
>>>>>>> is a
>>>>>>> fundamental concept, so it has to be treated as such not as a
>>>>>>> comment.in SPIP
>>>>>>>
>>>>>>> Hope this clarifies the connection in practical terms
>>>>>>> Dr Mich Talebzadeh,
>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Mich,
>>>>>>>>
>>>>>>>> Sorry, I may be missing something here but what does your
>>>>>>>> definition here have to do with the SPIP?   Perhaps add comments 
>>>>>>>> directly
>>>>>>>> to the SPIP to provide context as the code snippet below is a direct 
>>>>>>>> copy
>>>>>>>> from the SPIP itself.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Denny
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> just to add
>>>>>>>>>
>>>>>>>>> A stronger definition of real time. The engineering definition of
>>>>>>>>> real time is roughly fast enough to be interactive
>>>>>>>>>
>>>>>>>>> However, I put a stronger definition. In real time application or
>>>>>>>>> data, there is nothing as an answer which is supposed to be late and
>>>>>>>>> correct. The timeliness is part of the application.if I get the right
>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>> GDPR
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> The current limitations in SSS come from micro-batching.If you
>>>>>>>>>> are going to reduce micro-batching, this reduction must be balanced 
>>>>>>>>>> against
>>>>>>>>>> the available processing capacity of the cluster to prevent back 
>>>>>>>>>> pressure
>>>>>>>>>> and instability. In the case of Continuous Processing mode, a
>>>>>>>>>> specific continuous trigger with a desired checkpoint interval quote
>>>>>>>>>>
>>>>>>>>>> "
>>>>>>>>>> df.writeStream
>>>>>>>>>>    .format("...")
>>>>>>>>>>    .option("...")
>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger
>>>>>>>>>> type to enable real-time Mode
>>>>>>>>>>    .start()
>>>>>>>>>> This Trigger.RealTime signals that the query should run in the
>>>>>>>>>> new ultra low-latency execution mode.  A time interval can also be
>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each micro-batch 
>>>>>>>>>> should
>>>>>>>>>> run for.
>>>>>>>>>> "
>>>>>>>>>>
>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>>> GDPR
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've been
>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, 
>>>>>>>>>>> and
>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>>>>>>> ].
>>>>>>>>>>>
>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode”
>>>>>>>>>>> in Spark Structured Streaming that significantly lowers end-to-end 
>>>>>>>>>>> latency
>>>>>>>>>>> for processing streams of data.
>>>>>>>>>>>
>>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is
>>>>>>>>>>> to make Spark capable of handling streaming jobs that need results 
>>>>>>>>>>> almost
>>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this 
>>>>>>>>>>> without
>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already 
>>>>>>>>>>> use – so
>>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency 
>>>>>>>>>>> mode by
>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>>>>
>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>>>>>>> applications (like instant anomaly alerts or live personalization) 
>>>>>>>>>>> that
>>>>>>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>>>>>>> streaming
>>>>>>>>>>> engine.
>>>>>>>>>>>
>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions
>>>>>>>>>>> on this approach!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Yanbo
>>>>>
>>>>
>
> --
> John Zhuge
>
>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to