Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Chao Sun Wed, 28 May 2025 14:49:25 -0700

+1. Super excited by this initiative!

On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <[email protected]> wrote:


> +1
>
> On Wed, May 28, 2025 at 12:34 PM huaxin gao <[email protected]>
> wrote:
>
>> +1
>> By unifying batch and low-latency streaming in Spark, we can eliminate
>> the need for separate streaming engines, reducing system complexity and
>> operational cost. Excited to see this direction!
>>
>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> My point about "in real time application or data, there is nothing as an
>>> answer which is supposed to be late and correct. The timeliness is part of
>>> the application. if I get the right answer too slowly it becomes useless or
>>> wrong" is actually fundamental to *why* we need this Spark Structured
>>> Streaming proposal.
>>>
>>> The proposal is precisely about enabling Spark to power applications
>>> where, as I define it, the *timeliness* of the answer is as critical as
>>> its *correctness*. Spark's current streaming engine, primarily
>>> operating on micro-batches, often delivers results that are technically
>>> "correct" but arrive too late to be truly useful for certain high-stakes,
>>> real-time scenarios. This makes them "useless or wrong" in a practical,
>>> business-critical sense.
>>>
>>> For example *in real-time fraud detection* and In *high-frequency
>>> trading,* market data or trade execution commands must be delivered
>>> with minimal latency. Even a slight delay can mean missed opportunities or
>>> significant financial losses, making a "correct" price update useless if
>>> it's not instantaneous. able for these demanding use cases, where a
>>> "late but correct" answer is simply not good enough. As a colliery it is a
>>> fundamental concept, so it has to be treated as such not as a comment.in
>>> SPIP
>>>
>>> Hope this clarifies the connection in practical terms
>>> Dr Mich Talebzadeh,
>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 28 May 2025 at 16:32, Denny Lee <[email protected]> wrote:
>>>
>>>> Hey Mich,
>>>>
>>>> Sorry, I may be missing something here but what does your definition
>>>> here have to do with the SPIP?   Perhaps add comments directly to the SPIP
>>>> to provide context as the code snippet below is a direct copy from the SPIP
>>>> itself.
>>>>
>>>> Thanks,
>>>> Denny
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>>> just to add
>>>>>
>>>>> A stronger definition of real time. The engineering definition of real
>>>>> time is roughly fast enough to be interactive
>>>>>
>>>>> However, I put a stronger definition. In real time application or
>>>>> data, there is nothing as an answer which is supposed to be late and
>>>>> correct. The timeliness is part of the application.if I get the right
>>>>> answer too slowly it becomes useless or wrong
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh,
>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The current limitations in SSS come from micro-batching.If you are
>>>>>> going to reduce micro-batching, this reduction must be balanced against 
>>>>>> the
>>>>>> available processing capacity of the cluster to prevent back pressure and
>>>>>> instability. In the case of Continuous Processing mode, a specific
>>>>>> continuous trigger with a desired checkpoint interval quote
>>>>>>
>>>>>> "
>>>>>> df.writeStream
>>>>>>    .format("...")
>>>>>>    .option("...")
>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger type
>>>>>> to enable real-time Mode
>>>>>>    .start()
>>>>>> This Trigger.RealTime signals that the query should run in the new
>>>>>> ultra low-latency execution mode.  A time interval can also be specified,
>>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run for.
>>>>>> "
>>>>>>
>>>>>> will inevitably depend on many factors. Not that simple
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh,
>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I want to start a discussion thread for the SPIP titled “Real-Time
>>>>>>> Mode in Apache Spark Structured Streaming” that I've been working on 
>>>>>>> with
>>>>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael 
>>>>>>> Armbrust: [
>>>>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>>> ].
>>>>>>>
>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in
>>>>>>> Spark Structured Streaming that significantly lowers end-to-end latency 
>>>>>>> for
>>>>>>> processing streams of data.
>>>>>>>
>>>>>>> A key principle of this proposal is compatibility. Our goal is to
>>>>>>> make Spark capable of handling streaming jobs that need results almost
>>>>>>> immediately (within O(100) milliseconds). We want to achieve this 
>>>>>>> without
>>>>>>> changing the high-level DataFrame/Dataset API that users already use – 
>>>>>>> so
>>>>>>> existing streaming queries can run in this new ultra-low-latency mode by
>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>
>>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>>> applications (like instant anomaly alerts or live personalization) that
>>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>>> streaming
>>>>>>> engine.
>>>>>>>
>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions on
>>>>>>> this approach!
>>>>>>>
>>>>>>>
>
> --
> Best,
> Yanbo
>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to