Hi,

My point about "in real time application or data, there is nothing as an
answer which is supposed to be late and correct. The timeliness is part of
the application. if I get the right answer too slowly it becomes useless or
wrong" is actually fundamental to *why* we need this Spark Structured
Streaming proposal.

The proposal is precisely about enabling Spark to power applications where,
as I define it, the *timeliness* of the answer is as critical as its
*correctness*. Spark's current streaming engine, primarily operating on
micro-batches, often delivers results that are technically "correct" but
arrive too late to be truly useful for certain high-stakes, real-time
scenarios. This makes them "useless or wrong" in a practical,
business-critical sense.

For example *in real-time fraud detection* and In *high-frequency trading,*
market data or trade execution commands must be delivered with minimal
latency. Even a slight delay can mean missed opportunities or significant
financial losses, making a "correct" price update useless if it's not
instantaneous. able for these demanding use cases, where a "late but
correct" answer is simply not good enough. As a colliery it is a
fundamental concept, so it has to be treated as such not as a comment.in
SPIP

Hope this clarifies the connection in practical terms
Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote:

> Hey Mich,
>
> Sorry, I may be missing something here but what does your definition here
> have to do with the SPIP?   Perhaps add comments directly to the SPIP to
> provide context as the code snippet below is a direct copy from the SPIP
> itself.
>
> Thanks,
> Denny
>
>
>
>
> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> just to add
>>
>> A stronger definition of real time. The engineering definition of real
>> time is roughly fast enough to be interactive
>>
>> However, I put a stronger definition. In real time application or data,
>> there is nothing as an answer which is supposed to be late and correct. The
>> timeliness is part of the application.if I get the right answer too slowly
>> it becomes useless or wrong
>>
>>
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> The current limitations in SSS come from micro-batching.If you are going
>>> to reduce micro-batching, this reduction must be balanced against the
>>> available processing capacity of the cluster to prevent back pressure and
>>> instability. In the case of Continuous Processing mode, a specific
>>> continuous trigger with a desired checkpoint interval quote
>>>
>>> "
>>> df.writeStream
>>>    .format("...")
>>>    .option("...")
>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger type to
>>> enable real-time Mode
>>>    .start()
>>> This Trigger.RealTime signals that the query should run in the new ultra
>>> low-latency execution mode.  A time interval can also be specified, e.g.
>>> “300 Seconds”, to indicate how long each micro-batch should run for.
>>> "
>>>
>>> will inevitably depend on many factors. Not that simple
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh,
>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <jerry.boyang.p...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I want to start a discussion thread for the SPIP titled “Real-Time Mode
>>>> in Apache Spark Structured Streaming” that I've been working on with Siying
>>>> Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: [JIRA
>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>> ].
>>>>
>>>> The SPIP proposes a new execution mode called “Real-time Mode” in Spark
>>>> Structured Streaming that significantly lowers end-to-end latency for
>>>> processing streams of data.
>>>>
>>>> A key principle of this proposal is compatibility. Our goal is to make
>>>> Spark capable of handling streaming jobs that need results almost
>>>> immediately (within O(100) milliseconds). We want to achieve this without
>>>> changing the high-level DataFrame/Dataset API that users already use – so
>>>> existing streaming queries can run in this new ultra-low-latency mode by
>>>> simply turning it on, without rewriting their logic.
>>>>
>>>> In short, we’re trying to enable Spark to power real-time applications
>>>> (like instant anomaly alerts or live personalization) that today cannot
>>>> meet their latency requirements with Spark’s current streaming engine.
>>>>
>>>> We'd greatly appreciate your feedback, thoughts, and suggestions on
>>>> this approach!
>>>>
>>>>

Reply via email to