Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Kent Yao Wed, 28 May 2025 19:31:30 -0700

+1, LGTM.

Kent


在 2025年5月29日星期四，Chao Sun <[email protected]> 写道：

> +1. Super excited by this initiative!
>
> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <[email protected]> wrote:
>
>> +1
>>
>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <[email protected]>
>> wrote:
>>
>>> +1
>>> By unifying batch and low-latency streaming in Spark, we can eliminate
>>> the need for separate streaming engines, reducing system complexity and
>>> operational cost. Excited to see this direction!
>>>
>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> My point about "in real time application or data, there is nothing as
>>>> an answer which is supposed to be late and correct. The timeliness is part
>>>> of the application. if I get the right answer too slowly it becomes useless
>>>> or wrong" is actually fundamental to *why* we need this Spark
>>>> Structured Streaming proposal.
>>>>
>>>> The proposal is precisely about enabling Spark to power applications
>>>> where, as I define it, the *timeliness* of the answer is as critical
>>>> as its *correctness*. Spark's current streaming engine, primarily
>>>> operating on micro-batches, often delivers results that are technically
>>>> "correct" but arrive too late to be truly useful for certain high-stakes,
>>>> real-time scenarios. This makes them "useless or wrong" in a practical,
>>>> business-critical sense.
>>>>
>>>> For example *in real-time fraud detection* and In *high-frequency
>>>> trading,* market data or trade execution commands must be delivered
>>>> with minimal latency. Even a slight delay can mean missed opportunities or
>>>> significant financial losses, making a "correct" price update useless if
>>>> it's not instantaneous. able for these demanding use cases, where a
>>>> "late but correct" answer is simply not good enough. As a colliery it is a
>>>> fundamental concept, so it has to be treated as such not as a
>>>> comment.in SPIP
>>>>
>>>> Hope this clarifies the connection in practical terms
>>>> Dr Mich Talebzadeh,
>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <[email protected]> wrote:
>>>>
>>>>> Hey Mich,
>>>>>
>>>>> Sorry, I may be missing something here but what does your definition
>>>>> here have to do with the SPIP?   Perhaps add comments directly to the SPIP
>>>>> to provide context as the code snippet below is a direct copy from the 
>>>>> SPIP
>>>>> itself.
>>>>>
>>>>> Thanks,
>>>>> Denny
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> just to add
>>>>>>
>>>>>> A stronger definition of real time. The engineering definition of
>>>>>> real time is roughly fast enough to be interactive
>>>>>>
>>>>>> However, I put a stronger definition. In real time application or
>>>>>> data, there is nothing as an answer which is supposed to be late and
>>>>>> correct. The timeliness is part of the application.if I get the right
>>>>>> answer too slowly it becomes useless or wrong
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh,
>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> The current limitations in SSS come from micro-batching.If you are
>>>>>>> going to reduce micro-batching, this reduction must be balanced against 
>>>>>>> the
>>>>>>> available processing capacity of the cluster to prevent back pressure 
>>>>>>> and
>>>>>>> instability. In the case of Continuous Processing mode, a specific
>>>>>>> continuous trigger with a desired checkpoint interval quote
>>>>>>>
>>>>>>> "
>>>>>>> df.writeStream
>>>>>>>    .format("...")
>>>>>>>    .option("...")
>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger type
>>>>>>> to enable real-time Mode
>>>>>>>    .start()
>>>>>>> This Trigger.RealTime signals that the query should run in the new
>>>>>>> ultra low-latency execution mode.  A time interval can also be 
>>>>>>> specified,
>>>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run 
>>>>>>> for.
>>>>>>> "
>>>>>>>
>>>>>>> will inevitably depend on many factors. Not that simple
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh,
>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I want to start a discussion thread for the SPIP titled “Real-Time
>>>>>>>> Mode in Apache Spark Structured Streaming” that I've been working on 
>>>>>>>> with
>>>>>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael 
>>>>>>>> Armbrust: [
>>>>>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>>>> ].
>>>>>>>>
>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in
>>>>>>>> Spark Structured Streaming that significantly lowers end-to-end 
>>>>>>>> latency for
>>>>>>>> processing streams of data.
>>>>>>>>
>>>>>>>> A key principle of this proposal is compatibility. Our goal is to
>>>>>>>> make Spark capable of handling streaming jobs that need results almost
>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this 
>>>>>>>> without
>>>>>>>> changing the high-level DataFrame/Dataset API that users already use – 
>>>>>>>> so
>>>>>>>> existing streaming queries can run in this new ultra-low-latency mode 
>>>>>>>> by
>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>
>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>>>> applications (like instant anomaly alerts or live personalization) that
>>>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>>>> streaming
>>>>>>>> engine.
>>>>>>>>
>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions on
>>>>>>>> this approach!
>>>>>>>>
>>>>>>>>
>>
>> --
>> Best,
>> Yanbo
>>
>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to