Thanks to everyone in the community for your interest and support for this proposal. We've had extensive and constructive discussions both in this thread and in the SPIP document. These conversations have been positive and encouraging for moving in this direction. Special thanks to the SPIP authors for actively addressing questions and feedback.
Unless something has been overlooked, it seems that the discussions have largely settled. The next step will be to initiate a vote on the SPIP. To allow time for any final comments or feedback, I will hold off on starting the vote for another day or two. On Fri, May 30, 2025 at 10:56 AM Denny Lee <denny.g....@gmail.com> wrote: > +1 (non-binding) > > On Fri, May 30, 2025 at 9:17 AM xianjin <xian...@apache.org> wrote: > >> +1 >> Sent from my iPhone >> >> On May 29, 2025, at 12:53 PM, Yuanjian Li <xyliyuanj...@gmail.com> wrote: >> >> >> +1 >> >> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >> >>> +1, LGTM. >>> >>> Kent >>> >>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>> >>>> +1. Super excited by this initiative! >>>> >>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> wrote: >>>> >>>>> +1 >>>>> >>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> >>>>> wrote: >>>>> >>>>>> +1 >>>>>> By unifying batch and low-latency streaming in Spark, we can >>>>>> eliminate the need for separate streaming engines, reducing system >>>>>> complexity and operational cost. Excited to see this direction! >>>>>> >>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> My point about "in real time application or data, there is nothing >>>>>>> as an answer which is supposed to be late and correct. The timeliness is >>>>>>> part of the application. if I get the right answer too slowly it becomes >>>>>>> useless or wrong" is actually fundamental to *why* we need this >>>>>>> Spark Structured Streaming proposal. >>>>>>> >>>>>>> The proposal is precisely about enabling Spark to power applications >>>>>>> where, as I define it, the *timeliness* of the answer is as >>>>>>> critical as its *correctness*. Spark's current streaming engine, >>>>>>> primarily operating on micro-batches, often delivers results that are >>>>>>> technically "correct" but arrive too late to be truly useful for certain >>>>>>> high-stakes, real-time scenarios. This makes them "useless or wrong" in >>>>>>> a >>>>>>> practical, business-critical sense. >>>>>>> >>>>>>> For example *in real-time fraud detection* and In *high-frequency >>>>>>> trading,* market data or trade execution commands must be delivered >>>>>>> with minimal latency. Even a slight delay can mean missed opportunities >>>>>>> or >>>>>>> significant financial losses, making a "correct" price update useless if >>>>>>> it's not instantaneous. able for these demanding use cases, where a >>>>>>> "late but correct" answer is simply not good enough. As a colliery it >>>>>>> is a >>>>>>> fundamental concept, so it has to be treated as such not as a >>>>>>> comment.in SPIP >>>>>>> >>>>>>> Hope this clarifies the connection in practical terms >>>>>>> Dr Mich Talebzadeh, >>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Mich, >>>>>>>> >>>>>>>> Sorry, I may be missing something here but what does your >>>>>>>> definition here have to do with the SPIP? Perhaps add comments >>>>>>>> directly >>>>>>>> to the SPIP to provide context as the code snippet below is a direct >>>>>>>> copy >>>>>>>> from the SPIP itself. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Denny >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>>> just to add >>>>>>>>> >>>>>>>>> A stronger definition of real time. The engineering definition of >>>>>>>>> real time is roughly fast enough to be interactive >>>>>>>>> >>>>>>>>> However, I put a stronger definition. In real time application or >>>>>>>>> data, there is nothing as an answer which is supposed to be late and >>>>>>>>> correct. The timeliness is part of the application.if I get the right >>>>>>>>> answer too slowly it becomes useless or wrong >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>>> GDPR >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> The current limitations in SSS come from micro-batching.If you >>>>>>>>>> are going to reduce micro-batching, this reduction must be balanced >>>>>>>>>> against >>>>>>>>>> the available processing capacity of the cluster to prevent back >>>>>>>>>> pressure >>>>>>>>>> and instability. In the case of Continuous Processing mode, a >>>>>>>>>> specific continuous trigger with a desired checkpoint interval quote >>>>>>>>>> >>>>>>>>>> " >>>>>>>>>> df.writeStream >>>>>>>>>> .format("...") >>>>>>>>>> .option("...") >>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger >>>>>>>>>> type to enable real-time Mode >>>>>>>>>> .start() >>>>>>>>>> This Trigger.RealTime signals that the query should run in the >>>>>>>>>> new ultra low-latency execution mode. A time interval can also be >>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each micro-batch >>>>>>>>>> should >>>>>>>>>> run for. >>>>>>>>>> " >>>>>>>>>> >>>>>>>>>> will inevitably depend on many factors. Not that simple >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>>>> GDPR >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've been >>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, >>>>>>>>>>> and >>>>>>>>>>> Michael Armbrust: [JIRA >>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>>>>>>>>> ]. >>>>>>>>>>> >>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” >>>>>>>>>>> in Spark Structured Streaming that significantly lowers end-to-end >>>>>>>>>>> latency >>>>>>>>>>> for processing streams of data. >>>>>>>>>>> >>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is >>>>>>>>>>> to make Spark capable of handling streaming jobs that need results >>>>>>>>>>> almost >>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this >>>>>>>>>>> without >>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already >>>>>>>>>>> use – so >>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency >>>>>>>>>>> mode by >>>>>>>>>>> simply turning it on, without rewriting their logic. >>>>>>>>>>> >>>>>>>>>>> In short, we’re trying to enable Spark to power real-time >>>>>>>>>>> applications (like instant anomaly alerts or live personalization) >>>>>>>>>>> that >>>>>>>>>>> today cannot meet their latency requirements with Spark’s current >>>>>>>>>>> streaming >>>>>>>>>>> engine. >>>>>>>>>>> >>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions >>>>>>>>>>> on this approach! >>>>>>>>>>> >>>>>>>>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Yanbo >>>>> >>>>