+1. Super excited by this initiative! On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> wrote:
> +1 > > On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> > wrote: > >> +1 >> By unifying batch and low-latency streaming in Spark, we can eliminate >> the need for separate streaming engines, reducing system complexity and >> operational cost. Excited to see this direction! >> >> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi, >>> >>> My point about "in real time application or data, there is nothing as an >>> answer which is supposed to be late and correct. The timeliness is part of >>> the application. if I get the right answer too slowly it becomes useless or >>> wrong" is actually fundamental to *why* we need this Spark Structured >>> Streaming proposal. >>> >>> The proposal is precisely about enabling Spark to power applications >>> where, as I define it, the *timeliness* of the answer is as critical as >>> its *correctness*. Spark's current streaming engine, primarily >>> operating on micro-batches, often delivers results that are technically >>> "correct" but arrive too late to be truly useful for certain high-stakes, >>> real-time scenarios. This makes them "useless or wrong" in a practical, >>> business-critical sense. >>> >>> For example *in real-time fraud detection* and In *high-frequency >>> trading,* market data or trade execution commands must be delivered >>> with minimal latency. Even a slight delay can mean missed opportunities or >>> significant financial losses, making a "correct" price update useless if >>> it's not instantaneous. able for these demanding use cases, where a >>> "late but correct" answer is simply not good enough. As a colliery it is a >>> fundamental concept, so it has to be treated as such not as a comment.in >>> SPIP >>> >>> Hope this clarifies the connection in practical terms >>> Dr Mich Talebzadeh, >>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> >>> >>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote: >>> >>>> Hey Mich, >>>> >>>> Sorry, I may be missing something here but what does your definition >>>> here have to do with the SPIP? Perhaps add comments directly to the SPIP >>>> to provide context as the code snippet below is a direct copy from the SPIP >>>> itself. >>>> >>>> Thanks, >>>> Denny >>>> >>>> >>>> >>>> >>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> just to add >>>>> >>>>> A stronger definition of real time. The engineering definition of real >>>>> time is roughly fast enough to be interactive >>>>> >>>>> However, I put a stronger definition. In real time application or >>>>> data, there is nothing as an answer which is supposed to be late and >>>>> correct. The timeliness is part of the application.if I get the right >>>>> answer too slowly it becomes useless or wrong >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh, >>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> The current limitations in SSS come from micro-batching.If you are >>>>>> going to reduce micro-batching, this reduction must be balanced against >>>>>> the >>>>>> available processing capacity of the cluster to prevent back pressure and >>>>>> instability. In the case of Continuous Processing mode, a specific >>>>>> continuous trigger with a desired checkpoint interval quote >>>>>> >>>>>> " >>>>>> df.writeStream >>>>>> .format("...") >>>>>> .option("...") >>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger type >>>>>> to enable real-time Mode >>>>>> .start() >>>>>> This Trigger.RealTime signals that the query should run in the new >>>>>> ultra low-latency execution mode. A time interval can also be specified, >>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run for. >>>>>> " >>>>>> >>>>>> will inevitably depend on many factors. Not that simple >>>>>> HTH >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh, >>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <jerry.boyang.p...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I want to start a discussion thread for the SPIP titled “Real-Time >>>>>>> Mode in Apache Spark Structured Streaming” that I've been working on >>>>>>> with >>>>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael >>>>>>> Armbrust: [ >>>>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>>>>> ]. >>>>>>> >>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in >>>>>>> Spark Structured Streaming that significantly lowers end-to-end latency >>>>>>> for >>>>>>> processing streams of data. >>>>>>> >>>>>>> A key principle of this proposal is compatibility. Our goal is to >>>>>>> make Spark capable of handling streaming jobs that need results almost >>>>>>> immediately (within O(100) milliseconds). We want to achieve this >>>>>>> without >>>>>>> changing the high-level DataFrame/Dataset API that users already use – >>>>>>> so >>>>>>> existing streaming queries can run in this new ultra-low-latency mode by >>>>>>> simply turning it on, without rewriting their logic. >>>>>>> >>>>>>> In short, we’re trying to enable Spark to power real-time >>>>>>> applications (like instant anomaly alerts or live personalization) that >>>>>>> today cannot meet their latency requirements with Spark’s current >>>>>>> streaming >>>>>>> engine. >>>>>>> >>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions on >>>>>>> this approach! >>>>>>> >>>>>>> > > -- > Best, > Yanbo >