Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Denny Lee Wed, 28 May 2025 08:33:15 -0700

Hey Mich,

Sorry, I may be missing something here but what does your definition here
have to do with the SPIP?   Perhaps add comments directly to the SPIP to
provide context as the code snippet below is a direct copy from the SPIP
itself.


Thanks,
Denny




On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> just to add
>
> A stronger definition of real time. The engineering definition of real
> time is roughly fast enough to be interactive
>
> However, I put a stronger definition. In real time application or data,
> there is nothing as an answer which is supposed to be late and correct. The
> timeliness is part of the application.if I get the right answer too slowly
> it becomes useless or wrong
>
>
>
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> The current limitations in SSS come from micro-batching.If you are going
>> to reduce micro-batching, this reduction must be balanced against the
>> available processing capacity of the cluster to prevent back pressure and
>> instability. In the case of Continuous Processing mode, a specific
>> continuous trigger with a desired checkpoint interval quote
>>
>> "
>> df.writeStream
>>    .format("...")
>>    .option("...")
>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger type to
>> enable real-time Mode
>>    .start()
>> This Trigger.RealTime signals that the query should run in the new ultra
>> low-latency execution mode.  A time interval can also be specified, e.g.
>> “300 Seconds”, to indicate how long each micro-batch should run for.
>> "
>>
>> will inevitably depend on many factors. Not that simple
>> HTH
>>
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Wed, 28 May 2025 at 05:13, Jerry Peng <jerry.boyang.p...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I want to start a discussion thread for the SPIP titled “Real-Time Mode
>>> in Apache Spark Structured Streaming” that I've been working on with Siying
>>> Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: [JIRA
>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>> ].
>>>
>>> The SPIP proposes a new execution mode called “Real-time Mode” in Spark
>>> Structured Streaming that significantly lowers end-to-end latency for
>>> processing streams of data.
>>>
>>> A key principle of this proposal is compatibility. Our goal is to make
>>> Spark capable of handling streaming jobs that need results almost
>>> immediately (within O(100) milliseconds). We want to achieve this without
>>> changing the high-level DataFrame/Dataset API that users already use – so
>>> existing streaming queries can run in this new ultra-low-latency mode by
>>> simply turning it on, without rewriting their logic.
>>>
>>> In short, we’re trying to enable Spark to power real-time applications
>>> (like instant anomaly alerts or live personalization) that today cannot
>>> meet their latency requirements with Spark’s current streaming engine.
>>>
>>> We'd greatly appreciate your feedback, thoughts, and suggestions on this
>>> approach!
>>>
>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to