+1. On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote:
> +1 > Sent from my iPhone > > On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote: > > > +1 Nice feature > > On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com> > wrote: > >> +1 >> >> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >> >>> +1, LGTM. >>> >>> Kent >>> >>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>> >>>> +1. Super excited by this initiative! >>>> >>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> wrote: >>>> >>>>> +1 >>>>> >>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> >>>>> wrote: >>>>> >>>>>> +1 >>>>>> By unifying batch and low-latency streaming in Spark, we can >>>>>> eliminate the need for separate streaming engines, reducing system >>>>>> complexity and operational cost. Excited to see this direction! >>>>>> >>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> My point about "in real time application or data, there is nothing >>>>>>> as an answer which is supposed to be late and correct. The timeliness is >>>>>>> part of the application. if I get the right answer too slowly it becomes >>>>>>> useless or wrong" is actually fundamental to *why* we need this >>>>>>> Spark Structured Streaming proposal. >>>>>>> >>>>>>> The proposal is precisely about enabling Spark to power applications >>>>>>> where, as I define it, the *timeliness* of the answer is as >>>>>>> critical as its *correctness*. Spark's current streaming engine, >>>>>>> primarily operating on micro-batches, often delivers results that are >>>>>>> technically "correct" but arrive too late to be truly useful for certain >>>>>>> high-stakes, real-time scenarios. This makes them "useless or wrong" in >>>>>>> a >>>>>>> practical, business-critical sense. >>>>>>> >>>>>>> For example *in real-time fraud detection* and In *high-frequency >>>>>>> trading,* market data or trade execution commands must be delivered >>>>>>> with minimal latency. Even a slight delay can mean missed opportunities >>>>>>> or >>>>>>> significant financial losses, making a "correct" price update useless if >>>>>>> it's not instantaneous. able for these demanding use cases, where a >>>>>>> "late but correct" answer is simply not good enough. As a colliery it >>>>>>> is a >>>>>>> fundamental concept, so it has to be treated as such not as a >>>>>>> comment.in SPIP >>>>>>> >>>>>>> Hope this clarifies the connection in practical terms >>>>>>> Dr Mich Talebzadeh, >>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Mich, >>>>>>>> >>>>>>>> Sorry, I may be missing something here but what does your >>>>>>>> definition here have to do with the SPIP? Perhaps add comments >>>>>>>> directly >>>>>>>> to the SPIP to provide context as the code snippet below is a direct >>>>>>>> copy >>>>>>>> from the SPIP itself. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Denny >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>>> just to add >>>>>>>>> >>>>>>>>> A stronger definition of real time. The engineering definition of >>>>>>>>> real time is roughly fast enough to be interactive >>>>>>>>> >>>>>>>>> However, I put a stronger definition. In real time application or >>>>>>>>> data, there is nothing as an answer which is supposed to be late and >>>>>>>>> correct. The timeliness is part of the application.if I get the right >>>>>>>>> answer too slowly it becomes useless or wrong >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>>> GDPR >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> The current limitations in SSS come from micro-batching.If you >>>>>>>>>> are going to reduce micro-batching, this reduction must be balanced >>>>>>>>>> against >>>>>>>>>> the available processing capacity of the cluster to prevent back >>>>>>>>>> pressure >>>>>>>>>> and instability. In the case of Continuous Processing mode, a >>>>>>>>>> specific continuous trigger with a desired checkpoint interval quote >>>>>>>>>> >>>>>>>>>> " >>>>>>>>>> df.writeStream >>>>>>>>>> .format("...") >>>>>>>>>> .option("...") >>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger >>>>>>>>>> type to enable real-time Mode >>>>>>>>>> .start() >>>>>>>>>> This Trigger.RealTime signals that the query should run in the >>>>>>>>>> new ultra low-latency execution mode. A time interval can also be >>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each micro-batch >>>>>>>>>> should >>>>>>>>>> run for. >>>>>>>>>> " >>>>>>>>>> >>>>>>>>>> will inevitably depend on many factors. Not that simple >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>>>> GDPR >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've been >>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, >>>>>>>>>>> and >>>>>>>>>>> Michael Armbrust: [JIRA >>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>>>>>>>>> ]. >>>>>>>>>>> >>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” >>>>>>>>>>> in Spark Structured Streaming that significantly lowers end-to-end >>>>>>>>>>> latency >>>>>>>>>>> for processing streams of data. >>>>>>>>>>> >>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is >>>>>>>>>>> to make Spark capable of handling streaming jobs that need results >>>>>>>>>>> almost >>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this >>>>>>>>>>> without >>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already >>>>>>>>>>> use – so >>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency >>>>>>>>>>> mode by >>>>>>>>>>> simply turning it on, without rewriting their logic. >>>>>>>>>>> >>>>>>>>>>> In short, we’re trying to enable Spark to power real-time >>>>>>>>>>> applications (like instant anomaly alerts or live personalization) >>>>>>>>>>> that >>>>>>>>>>> today cannot meet their latency requirements with Spark’s current >>>>>>>>>>> streaming >>>>>>>>>>> engine. >>>>>>>>>>> >>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions >>>>>>>>>>> on this approach! >>>>>>>>>>> >>>>>>>>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Yanbo >>>>> >>>> > > -- > John Zhuge > >