+1 Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道:
> +1. > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote: > >> +1 >> Sent from my iPhone >> >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote: >> >> >> +1 Nice feature >> >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com> >> wrote: >> >>> +1 >>> >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: >>> >>>> +1, LGTM. >>>> >>>> Kent >>>> >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >>>> >>>>> +1. Super excited by this initiative! >>>>> >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> >>>>> wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> +1 >>>>>>> By unifying batch and low-latency streaming in Spark, we can >>>>>>> eliminate the need for separate streaming engines, reducing system >>>>>>> complexity and operational cost. Excited to see this direction! >>>>>>> >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> My point about "in real time application or data, there is nothing >>>>>>>> as an answer which is supposed to be late and correct. The timeliness >>>>>>>> is >>>>>>>> part of the application. if I get the right answer too slowly it >>>>>>>> becomes >>>>>>>> useless or wrong" is actually fundamental to *why* we need this >>>>>>>> Spark Structured Streaming proposal. >>>>>>>> >>>>>>>> The proposal is precisely about enabling Spark to power >>>>>>>> applications where, as I define it, the *timeliness* of the answer >>>>>>>> is as critical as its *correctness*. Spark's current streaming >>>>>>>> engine, primarily operating on micro-batches, often delivers results >>>>>>>> that >>>>>>>> are technically "correct" but arrive too late to be truly useful for >>>>>>>> certain high-stakes, real-time scenarios. This makes them "useless or >>>>>>>> wrong" in a practical, business-critical sense. >>>>>>>> >>>>>>>> For example *in real-time fraud detection* and In *high-frequency >>>>>>>> trading,* market data or trade execution commands must be >>>>>>>> delivered with minimal latency. Even a slight delay can mean missed >>>>>>>> opportunities or significant financial losses, making a "correct" price >>>>>>>> update useless if it's not instantaneous. able for these demanding >>>>>>>> use cases, where a "late but correct" answer is simply not good >>>>>>>> enough. As >>>>>>>> a colliery it is a fundamental concept, so it has to be treated as >>>>>>>> such not >>>>>>>> as a comment.in SPIP >>>>>>>> >>>>>>>> Hope this clarifies the connection in practical terms >>>>>>>> Dr Mich Talebzadeh, >>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>> GDPR >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Mich, >>>>>>>>> >>>>>>>>> Sorry, I may be missing something here but what does your >>>>>>>>> definition here have to do with the SPIP? Perhaps add comments >>>>>>>>> directly >>>>>>>>> to the SPIP to provide context as the code snippet below is a direct >>>>>>>>> copy >>>>>>>>> from the SPIP itself. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Denny >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> just to add >>>>>>>>>> >>>>>>>>>> A stronger definition of real time. The engineering definition of >>>>>>>>>> real time is roughly fast enough to be interactive >>>>>>>>>> >>>>>>>>>> However, I put a stronger definition. In real time application or >>>>>>>>>> data, there is nothing as an answer which is supposed to be late and >>>>>>>>>> correct. The timeliness is part of the application.if I get the right >>>>>>>>>> answer too slowly it becomes useless or wrong >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>>>> GDPR >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> The current limitations in SSS come from micro-batching.If you >>>>>>>>>>> are going to reduce micro-batching, this reduction must be balanced >>>>>>>>>>> against >>>>>>>>>>> the available processing capacity of the cluster to prevent back >>>>>>>>>>> pressure >>>>>>>>>>> and instability. In the case of Continuous Processing mode, a >>>>>>>>>>> specific continuous trigger with a desired checkpoint interval quote >>>>>>>>>>> >>>>>>>>>>> " >>>>>>>>>>> df.writeStream >>>>>>>>>>> .format("...") >>>>>>>>>>> .option("...") >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger >>>>>>>>>>> type to enable real-time Mode >>>>>>>>>>> .start() >>>>>>>>>>> This Trigger.RealTime signals that the query should run in the >>>>>>>>>>> new ultra low-latency execution mode. A time interval can also be >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each >>>>>>>>>>> micro-batch should >>>>>>>>>>> run for. >>>>>>>>>>> " >>>>>>>>>>> >>>>>>>>>>> will inevitably depend on many factors. Not that simple >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>>>>> GDPR >>>>>>>>>>> >>>>>>>>>>> view my Linkedin profile >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've >>>>>>>>>>>> been >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, >>>>>>>>>>>> and >>>>>>>>>>>> Michael Armbrust: [JIRA >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>>>>>>>>>> ]. >>>>>>>>>>>> >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers end-to-end >>>>>>>>>>>> latency >>>>>>>>>>>> for processing streams of data. >>>>>>>>>>>> >>>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is >>>>>>>>>>>> to make Spark capable of handling streaming jobs that need results >>>>>>>>>>>> almost >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this >>>>>>>>>>>> without >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already >>>>>>>>>>>> use – so >>>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency >>>>>>>>>>>> mode by >>>>>>>>>>>> simply turning it on, without rewriting their logic. >>>>>>>>>>>> >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time >>>>>>>>>>>> applications (like instant anomaly alerts or live personalization) >>>>>>>>>>>> that >>>>>>>>>>>> today cannot meet their latency requirements with Spark’s current >>>>>>>>>>>> streaming >>>>>>>>>>>> engine. >>>>>>>>>>>> >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and >>>>>>>>>>>> suggestions on this approach! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> >>>>>> -- >>>>>> Best, >>>>>> Yanbo >>>>>> >>>>> >> >> -- >> John Zhuge >> >>