+1 (non-binding) On Fri, May 30, 2025 at 9:17 AM xianjin <xian...@apache.org> wrote:
> +1 > Sent from my iPhone > > On May 29, 2025, at 12:53 PM, Yuanjian Li <xyliyuanj...@gmail.com> wrote: > > > +1 > > Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: > >> +1, LGTM. >> >> Kent >> >> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: >> >>> +1. Super excited by this initiative! >>> >>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> wrote: >>> >>>> +1 >>>> >>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> >>>> wrote: >>>> >>>>> +1 >>>>> By unifying batch and low-latency streaming in Spark, we can eliminate >>>>> the need for separate streaming engines, reducing system complexity and >>>>> operational cost. Excited to see this direction! >>>>> >>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> My point about "in real time application or data, there is nothing as >>>>>> an answer which is supposed to be late and correct. The timeliness is >>>>>> part >>>>>> of the application. if I get the right answer too slowly it becomes >>>>>> useless >>>>>> or wrong" is actually fundamental to *why* we need this Spark >>>>>> Structured Streaming proposal. >>>>>> >>>>>> The proposal is precisely about enabling Spark to power applications >>>>>> where, as I define it, the *timeliness* of the answer is as critical >>>>>> as its *correctness*. Spark's current streaming engine, primarily >>>>>> operating on micro-batches, often delivers results that are technically >>>>>> "correct" but arrive too late to be truly useful for certain high-stakes, >>>>>> real-time scenarios. This makes them "useless or wrong" in a practical, >>>>>> business-critical sense. >>>>>> >>>>>> For example *in real-time fraud detection* and In *high-frequency >>>>>> trading,* market data or trade execution commands must be delivered >>>>>> with minimal latency. Even a slight delay can mean missed opportunities >>>>>> or >>>>>> significant financial losses, making a "correct" price update useless if >>>>>> it's not instantaneous. able for these demanding use cases, where a >>>>>> "late but correct" answer is simply not good enough. As a colliery it is >>>>>> a >>>>>> fundamental concept, so it has to be treated as such not as a >>>>>> comment.in SPIP >>>>>> >>>>>> Hope this clarifies the connection in practical terms >>>>>> Dr Mich Talebzadeh, >>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hey Mich, >>>>>>> >>>>>>> Sorry, I may be missing something here but what does your definition >>>>>>> here have to do with the SPIP? Perhaps add comments directly to the >>>>>>> SPIP >>>>>>> to provide context as the code snippet below is a direct copy from the >>>>>>> SPIP >>>>>>> itself. >>>>>>> >>>>>>> Thanks, >>>>>>> Denny >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> just to add >>>>>>>> >>>>>>>> A stronger definition of real time. The engineering definition of >>>>>>>> real time is roughly fast enough to be interactive >>>>>>>> >>>>>>>> However, I put a stronger definition. In real time application or >>>>>>>> data, there is nothing as an answer which is supposed to be late and >>>>>>>> correct. The timeliness is part of the application.if I get the right >>>>>>>> answer too slowly it becomes useless or wrong >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Dr Mich Talebzadeh, >>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>> GDPR >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>>> The current limitations in SSS come from micro-batching.If you are >>>>>>>>> going to reduce micro-batching, this reduction must be balanced >>>>>>>>> against the >>>>>>>>> available processing capacity of the cluster to prevent back pressure >>>>>>>>> and >>>>>>>>> instability. In the case of Continuous Processing mode, a >>>>>>>>> specific continuous trigger with a desired checkpoint interval quote >>>>>>>>> >>>>>>>>> " >>>>>>>>> df.writeStream >>>>>>>>> .format("...") >>>>>>>>> .option("...") >>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger >>>>>>>>> type to enable real-time Mode >>>>>>>>> .start() >>>>>>>>> This Trigger.RealTime signals that the query should run in the new >>>>>>>>> ultra low-latency execution mode. A time interval can also be >>>>>>>>> specified, >>>>>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run >>>>>>>>> for. >>>>>>>>> " >>>>>>>>> >>>>>>>>> will inevitably depend on many factors. Not that simple >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh, >>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>>> GDPR >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I want to start a discussion thread for the SPIP titled >>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've been >>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, >>>>>>>>>> and >>>>>>>>>> Michael Armbrust: [JIRA >>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>>>>>>>> ]. >>>>>>>>>> >>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in >>>>>>>>>> Spark Structured Streaming that significantly lowers end-to-end >>>>>>>>>> latency for >>>>>>>>>> processing streams of data. >>>>>>>>>> >>>>>>>>>> A key principle of this proposal is compatibility. Our goal is to >>>>>>>>>> make Spark capable of handling streaming jobs that need results >>>>>>>>>> almost >>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this >>>>>>>>>> without >>>>>>>>>> changing the high-level DataFrame/Dataset API that users already use >>>>>>>>>> – so >>>>>>>>>> existing streaming queries can run in this new ultra-low-latency >>>>>>>>>> mode by >>>>>>>>>> simply turning it on, without rewriting their logic. >>>>>>>>>> >>>>>>>>>> In short, we’re trying to enable Spark to power real-time >>>>>>>>>> applications (like instant anomaly alerts or live personalization) >>>>>>>>>> that >>>>>>>>>> today cannot meet their latency requirements with Spark’s current >>>>>>>>>> streaming >>>>>>>>>> engine. >>>>>>>>>> >>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions >>>>>>>>>> on this approach! >>>>>>>>>> >>>>>>>>>> >>>> >>>> -- >>>> Best, >>>> Yanbo >>>> >>>