Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Xiao Li Thu, 29 May 2025 09:27:12 -0700

+1

Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：


> +1.
>
> On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote:
>
>> +1
>> Sent from my iPhone
>>
>> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote:
>>
>> 
>> +1 Nice feature
>>
>> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com>
>> wrote:
>>
>>> +1
>>>
>>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
>>>
>>>> +1, LGTM.
>>>>
>>>> Kent
>>>>
>>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>>>>
>>>>> +1. Super excited by this initiative!
>>>>>
>>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>> By unifying batch and low-latency streaming in Spark, we can
>>>>>>> eliminate the need for separate streaming engines, reducing system
>>>>>>> complexity and operational cost. Excited to see this direction!
>>>>>>>
>>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> My point about "in real time application or data, there is nothing
>>>>>>>> as an answer which is supposed to be late and correct. The timeliness 
>>>>>>>> is
>>>>>>>> part of the application. if I get the right answer too slowly it 
>>>>>>>> becomes
>>>>>>>> useless or wrong" is actually fundamental to *why* we need this
>>>>>>>> Spark Structured Streaming proposal.
>>>>>>>>
>>>>>>>> The proposal is precisely about enabling Spark to power
>>>>>>>> applications where, as I define it, the *timeliness* of the answer
>>>>>>>> is as critical as its *correctness*. Spark's current streaming
>>>>>>>> engine, primarily operating on micro-batches, often delivers results 
>>>>>>>> that
>>>>>>>> are technically "correct" but arrive too late to be truly useful for
>>>>>>>> certain high-stakes, real-time scenarios. This makes them "useless or
>>>>>>>> wrong" in a practical, business-critical sense.
>>>>>>>>
>>>>>>>> For example *in real-time fraud detection* and In *high-frequency
>>>>>>>> trading,* market data or trade execution commands must be
>>>>>>>> delivered with minimal latency. Even a slight delay can mean missed
>>>>>>>> opportunities or significant financial losses, making a "correct" price
>>>>>>>> update useless if it's not instantaneous. able for these demanding
>>>>>>>> use cases, where a "late but correct" answer is simply not good 
>>>>>>>> enough. As
>>>>>>>> a colliery it is a fundamental concept, so it has to be treated as 
>>>>>>>> such not
>>>>>>>> as a comment.in SPIP
>>>>>>>>
>>>>>>>> Hope this clarifies the connection in practical terms
>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>> GDPR
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Mich,
>>>>>>>>>
>>>>>>>>> Sorry, I may be missing something here but what does your
>>>>>>>>> definition here have to do with the SPIP?   Perhaps add comments 
>>>>>>>>> directly
>>>>>>>>> to the SPIP to provide context as the code snippet below is a direct 
>>>>>>>>> copy
>>>>>>>>> from the SPIP itself.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Denny
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> just to add
>>>>>>>>>>
>>>>>>>>>> A stronger definition of real time. The engineering definition of
>>>>>>>>>> real time is roughly fast enough to be interactive
>>>>>>>>>>
>>>>>>>>>> However, I put a stronger definition. In real time application or
>>>>>>>>>> data, there is nothing as an answer which is supposed to be late and
>>>>>>>>>> correct. The timeliness is part of the application.if I get the right
>>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>>> GDPR
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The current limitations in SSS come from micro-batching.If you
>>>>>>>>>>> are going to reduce micro-batching, this reduction must be balanced 
>>>>>>>>>>> against
>>>>>>>>>>> the available processing capacity of the cluster to prevent back 
>>>>>>>>>>> pressure
>>>>>>>>>>> and instability. In the case of Continuous Processing mode, a
>>>>>>>>>>> specific continuous trigger with a desired checkpoint interval quote
>>>>>>>>>>>
>>>>>>>>>>> "
>>>>>>>>>>> df.writeStream
>>>>>>>>>>>    .format("...")
>>>>>>>>>>>    .option("...")
>>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger
>>>>>>>>>>> type to enable real-time Mode
>>>>>>>>>>>    .start()
>>>>>>>>>>> This Trigger.RealTime signals that the query should run in the
>>>>>>>>>>> new ultra low-latency execution mode.  A time interval can also be
>>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each 
>>>>>>>>>>> micro-batch should
>>>>>>>>>>> run for.
>>>>>>>>>>> "
>>>>>>>>>>>
>>>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>>>> GDPR
>>>>>>>>>>>
>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've 
>>>>>>>>>>>> been
>>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, 
>>>>>>>>>>>> and
>>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>>>>>>>> ].
>>>>>>>>>>>>
>>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode”
>>>>>>>>>>>> in Spark Structured Streaming that significantly lowers end-to-end 
>>>>>>>>>>>> latency
>>>>>>>>>>>> for processing streams of data.
>>>>>>>>>>>>
>>>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is
>>>>>>>>>>>> to make Spark capable of handling streaming jobs that need results 
>>>>>>>>>>>> almost
>>>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this 
>>>>>>>>>>>> without
>>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already 
>>>>>>>>>>>> use – so
>>>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency 
>>>>>>>>>>>> mode by
>>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>>>>>
>>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>>>>>>>> applications (like instant anomaly alerts or live personalization) 
>>>>>>>>>>>> that
>>>>>>>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>>>>>>>> streaming
>>>>>>>>>>>> engine.
>>>>>>>>>>>>
>>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>>>>>>>>> suggestions on this approach!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best,
>>>>>> Yanbo
>>>>>>
>>>>>
>>
>> --
>> John Zhuge
>>
>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to