+1

On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> wrote:

> +1
> By unifying batch and low-latency streaming in Spark, we can eliminate the
> need for separate streaming engines, reducing system complexity and
> operational cost. Excited to see this direction!
>
> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi,
>>
>> My point about "in real time application or data, there is nothing as an
>> answer which is supposed to be late and correct. The timeliness is part of
>> the application. if I get the right answer too slowly it becomes useless or
>> wrong" is actually fundamental to *why* we need this Spark Structured
>> Streaming proposal.
>>
>> The proposal is precisely about enabling Spark to power applications
>> where, as I define it, the *timeliness* of the answer is as critical as
>> its *correctness*. Spark's current streaming engine, primarily operating
>> on micro-batches, often delivers results that are technically "correct" but
>> arrive too late to be truly useful for certain high-stakes, real-time
>> scenarios. This makes them "useless or wrong" in a practical,
>> business-critical sense.
>>
>> For example *in real-time fraud detection* and In *high-frequency
>> trading,* market data or trade execution commands must be delivered with
>> minimal latency. Even a slight delay can mean missed opportunities or
>> significant financial losses, making a "correct" price update useless if
>> it's not instantaneous. able for these demanding use cases, where a
>> "late but correct" answer is simply not good enough. As a colliery it is a
>> fundamental concept, so it has to be treated as such not as a comment.in
>> SPIP
>>
>> Hope this clarifies the connection in practical terms
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote:
>>
>>> Hey Mich,
>>>
>>> Sorry, I may be missing something here but what does your definition
>>> here have to do with the SPIP?   Perhaps add comments directly to the SPIP
>>> to provide context as the code snippet below is a direct copy from the SPIP
>>> itself.
>>>
>>> Thanks,
>>> Denny
>>>
>>>
>>>
>>>
>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> just to add
>>>>
>>>> A stronger definition of real time. The engineering definition of real
>>>> time is roughly fast enough to be interactive
>>>>
>>>> However, I put a stronger definition. In real time application or data,
>>>> there is nothing as an answer which is supposed to be late and correct. The
>>>> timeliness is part of the application.if I get the right answer too slowly
>>>> it becomes useless or wrong
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh,
>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> The current limitations in SSS come from micro-batching.If you are
>>>>> going to reduce micro-batching, this reduction must be balanced against 
>>>>> the
>>>>> available processing capacity of the cluster to prevent back pressure and
>>>>> instability. In the case of Continuous Processing mode, a specific
>>>>> continuous trigger with a desired checkpoint interval quote
>>>>>
>>>>> "
>>>>> df.writeStream
>>>>>    .format("...")
>>>>>    .option("...")
>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger type to
>>>>> enable real-time Mode
>>>>>    .start()
>>>>> This Trigger.RealTime signals that the query should run in the new
>>>>> ultra low-latency execution mode.  A time interval can also be specified,
>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run for.
>>>>> "
>>>>>
>>>>> will inevitably depend on many factors. Not that simple
>>>>> HTH
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh,
>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <jerry.boyang.p...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I want to start a discussion thread for the SPIP titled “Real-Time
>>>>>> Mode in Apache Spark Structured Streaming” that I've been working on with
>>>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: 
>>>>>> [
>>>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>>> ].
>>>>>>
>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in
>>>>>> Spark Structured Streaming that significantly lowers end-to-end latency 
>>>>>> for
>>>>>> processing streams of data.
>>>>>>
>>>>>> A key principle of this proposal is compatibility. Our goal is to
>>>>>> make Spark capable of handling streaming jobs that need results almost
>>>>>> immediately (within O(100) milliseconds). We want to achieve this without
>>>>>> changing the high-level DataFrame/Dataset API that users already use – so
>>>>>> existing streaming queries can run in this new ultra-low-latency mode by
>>>>>> simply turning it on, without rewriting their logic.
>>>>>>
>>>>>> In short, we’re trying to enable Spark to power real-time
>>>>>> applications (like instant anomaly alerts or live personalization) that
>>>>>> today cannot meet their latency requirements with Spark’s current 
>>>>>> streaming
>>>>>> engine.
>>>>>>
>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions on
>>>>>> this approach!
>>>>>>
>>>>>>

-- 
Best,
Yanbo

Reply via email to