Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Yuanjian Li Wed, 28 May 2025 21:52:53 -0700

+1

Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：


> +1, LGTM.
>
> Kent
>
> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>
>> +1. Super excited by this initiative!
>>
>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>> By unifying batch and low-latency streaming in Spark, we can eliminate
>>>> the need for separate streaming engines, reducing system complexity and
>>>> operational cost. Excited to see this direction!
>>>>
>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> My point about "in real time application or data, there is nothing as
>>>>> an answer which is supposed to be late and correct. The timeliness is part
>>>>> of the application. if I get the right answer too slowly it becomes 
>>>>> useless
>>>>> or wrong" is actually fundamental to *why* we need this Spark
>>>>> Structured Streaming proposal.
>>>>>
>>>>> The proposal is precisely about enabling Spark to power applications
>>>>> where, as I define it, the *timeliness* of the answer is as critical
>>>>> as its *correctness*. Spark's current streaming engine, primarily
>>>>> operating on micro-batches, often delivers results that are technically
>>>>> "correct" but arrive too late to be truly useful for certain high-stakes,
>>>>> real-time scenarios. This makes them "useless or wrong" in a practical,
>>>>> business-critical sense.
>>>>>
>>>>> For example *in real-time fraud detection* and In *high-frequency
>>>>> trading,* market data or trade execution commands must be delivered
>>>>> with minimal latency. Even a slight delay can mean missed opportunities or
>>>>> significant financial losses, making a "correct" price update useless if
>>>>> it's not instantaneous. able for these demanding use cases, where a
>>>>> "late but correct" answer is simply not good enough. As a colliery it is a
>>>>> fundamental concept, so it has to be treated as such not as a
>>>>> comment.in SPIP
>>>>>
>>>>> Hope this clarifies the connection in practical terms
>>>>> Dr Mich Talebzadeh,
>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote:
>>>>>
>>>>>> Hey Mich,
>>>>>>
>>>>>> Sorry, I may be missing something here but what does your definition
>>>>>> here have to do with the SPIP?   Perhaps add comments directly to the 
>>>>>> SPIP
>>>>>> to provide context as the code snippet below is a direct copy from the 
>>>>>> SPIP
>>>>>> itself.
>>>>>>
>>>>>> Thanks,
>>>>>> Denny
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> just to add
>>>>>>>
>>>>>>> A stronger definition of real time. The engineering definition of
>>>>>>> real time is roughly fast enough to be interactive
>>>>>>>
>>>>>>> However, I put a stronger definition. In real time application or
>>>>>>> data, there is nothing as an answer which is supposed to be late and
>>>>>>> correct. The timeliness is part of the application.if I get the right
>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh,
>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> The current limitations in SSS come from micro-batching.If you are
>>>>>>>> going to reduce micro-batching, this reduction must be balanced 
>>>>>>>> against the
>>>>>>>> available processing capacity of the cluster to prevent back pressure 
>>>>>>>> and
>>>>>>>> instability. In the case of Continuous Processing mode, a specific
>>>>>>>> continuous trigger with a desired checkpoint interval quote
>>>>>>>>
>>>>>>>> "
>>>>>>>> df.writeStream
>>>>>>>>    .format("...")
>>>>>>>>    .option("...")
>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger type
>>>>>>>> to enable real-time Mode
>>>>>>>>    .start()
>>>>>>>> This Trigger.RealTime signals that the query should run in the new
>>>>>>>> ultra low-latency execution mode.  A time interval can also be 
>>>>>>>> specified,
>>>>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run 
>>>>>>>> for.
>>>>>>>> "
>>>>>>>>
>>>>>>>> will inevitably depend on many factors. Not that simple
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>> GDPR
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I want to start a discussion thread for the SPIP titled “Real-Time
>>>>>>>>> Mode in Apache Spark Structured Streaming” that I've been working on 
>>>>>>>>> with
>>>>>>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael 
>>>>>>>>> Armbrust: [
>>>>>>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>>>>> ].
>>>>>>>>>
>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in
>>>>>>>>> Spark Structured Streaming that significantly lowers end-to-end 
>>>>>>>>> latency for
>>>>>>>>> processing streams of data.
>>>>>>>>>
>>>>>>>>> A key principle of this proposal is compatibility. Our goal is to
>>>>>>>>> make Spark capable of handling streaming jobs that need results almost
>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this 
>>>>>>>>> without
>>>>>>>>> changing the high-level DataFrame/Dataset API that users already use 
>>>>>>>>> – so
>>>>>>>>> existing streaming queries can run in this new ultra-low-latency mode 
>>>>>>>>> by
>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>>
>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>>>>> applications (like instant anomaly alerts or live personalization) 
>>>>>>>>> that
>>>>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>>>>> streaming
>>>>>>>>> engine.
>>>>>>>>>
>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions
>>>>>>>>> on this approach!
>>>>>>>>>
>>>>>>>>>
>>>
>>> --
>>> Best,
>>> Yanbo
>>>
>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to