Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

John Zhuge Thu, 29 May 2025 00:15:58 -0700

+1 Nice feature

On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <[email protected]> wrote:


> +1
>
> Kent Yao <[email protected]> 于2025年5月28日周三 19:31写道：
>
>> +1, LGTM.
>>
>> Kent
>>
>> 在 2025年5月29日星期四，Chao Sun <[email protected]> 写道：
>>
>>> +1. Super excited by this initiative!
>>>
>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <[email protected]> wrote:
>>>
>>>> +1
>>>>
>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <[email protected]>
>>>> wrote:
>>>>
>>>>> +1
>>>>> By unifying batch and low-latency streaming in Spark, we can eliminate
>>>>> the need for separate streaming engines, reducing system complexity and
>>>>> operational cost. Excited to see this direction!
>>>>>
>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> My point about "in real time application or data, there is nothing as
>>>>>> an answer which is supposed to be late and correct. The timeliness is 
>>>>>> part
>>>>>> of the application. if I get the right answer too slowly it becomes 
>>>>>> useless
>>>>>> or wrong" is actually fundamental to *why* we need this Spark
>>>>>> Structured Streaming proposal.
>>>>>>
>>>>>> The proposal is precisely about enabling Spark to power applications
>>>>>> where, as I define it, the *timeliness* of the answer is as critical
>>>>>> as its *correctness*. Spark's current streaming engine, primarily
>>>>>> operating on micro-batches, often delivers results that are technically
>>>>>> "correct" but arrive too late to be truly useful for certain high-stakes,
>>>>>> real-time scenarios. This makes them "useless or wrong" in a practical,
>>>>>> business-critical sense.
>>>>>>
>>>>>> For example *in real-time fraud detection* and In *high-frequency
>>>>>> trading,* market data or trade execution commands must be delivered
>>>>>> with minimal latency. Even a slight delay can mean missed opportunities 
>>>>>> or
>>>>>> significant financial losses, making a "correct" price update useless if
>>>>>> it's not instantaneous. able for these demanding use cases, where a
>>>>>> "late but correct" answer is simply not good enough. As a colliery it is 
>>>>>> a
>>>>>> fundamental concept, so it has to be treated as such not as a
>>>>>> comment.in SPIP
>>>>>>
>>>>>> Hope this clarifies the connection in practical terms
>>>>>> Dr Mich Talebzadeh,
>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Mich,
>>>>>>>
>>>>>>> Sorry, I may be missing something here but what does your definition
>>>>>>> here have to do with the SPIP?   Perhaps add comments directly to the 
>>>>>>> SPIP
>>>>>>> to provide context as the code snippet below is a direct copy from the 
>>>>>>> SPIP
>>>>>>> itself.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Denny
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> just to add
>>>>>>>>
>>>>>>>> A stronger definition of real time. The engineering definition of
>>>>>>>> real time is roughly fast enough to be interactive
>>>>>>>>
>>>>>>>> However, I put a stronger definition. In real time application or
>>>>>>>> data, there is nothing as an answer which is supposed to be late and
>>>>>>>> correct. The timeliness is part of the application.if I get the right
>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>> GDPR
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> The current limitations in SSS come from micro-batching.If you are
>>>>>>>>> going to reduce micro-batching, this reduction must be balanced 
>>>>>>>>> against the
>>>>>>>>> available processing capacity of the cluster to prevent back pressure 
>>>>>>>>> and
>>>>>>>>> instability. In the case of Continuous Processing mode, a
>>>>>>>>> specific continuous trigger with a desired checkpoint interval quote
>>>>>>>>>
>>>>>>>>> "
>>>>>>>>> df.writeStream
>>>>>>>>>    .format("...")
>>>>>>>>>    .option("...")
>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger
>>>>>>>>> type to enable real-time Mode
>>>>>>>>>    .start()
>>>>>>>>> This Trigger.RealTime signals that the query should run in the new
>>>>>>>>> ultra low-latency execution mode.  A time interval can also be 
>>>>>>>>> specified,
>>>>>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run 
>>>>>>>>> for.
>>>>>>>>> "
>>>>>>>>>
>>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>> GDPR
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've been
>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, 
>>>>>>>>>> and
>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>>>>>> ].
>>>>>>>>>>
>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in
>>>>>>>>>> Spark Structured Streaming that significantly lowers end-to-end 
>>>>>>>>>> latency for
>>>>>>>>>> processing streams of data.
>>>>>>>>>>
>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is to
>>>>>>>>>> make Spark capable of handling streaming jobs that need results 
>>>>>>>>>> almost
>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this 
>>>>>>>>>> without
>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already use 
>>>>>>>>>> – so
>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency 
>>>>>>>>>> mode by
>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>>>
>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>>>>>> applications (like instant anomaly alerts or live personalization) 
>>>>>>>>>> that
>>>>>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>>>>>> streaming
>>>>>>>>>> engine.
>>>>>>>>>>
>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions
>>>>>>>>>> on this approach!
>>>>>>>>>>
>>>>>>>>>>
>>>>
>>>> --
>>>> Best,
>>>> Yanbo
>>>>
>>>

-- 
John Zhuge

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to