+1 Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道:
> +1, LGTM. > > Kent > > 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: > >> +1. Super excited by this initiative! >> >> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> wrote: >> >>> +1 >>> >>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> >>> wrote: >>> >>>> +1 >>>> By unifying batch and low-latency streaming in Spark, we can eliminate >>>> the need for separate streaming engines, reducing system complexity and >>>> operational cost. Excited to see this direction! >>>> >>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> My point about "in real time application or data, there is nothing as >>>>> an answer which is supposed to be late and correct. The timeliness is part >>>>> of the application. if I get the right answer too slowly it becomes >>>>> useless >>>>> or wrong" is actually fundamental to *why* we need this Spark >>>>> Structured Streaming proposal. >>>>> >>>>> The proposal is precisely about enabling Spark to power applications >>>>> where, as I define it, the *timeliness* of the answer is as critical >>>>> as its *correctness*. Spark's current streaming engine, primarily >>>>> operating on micro-batches, often delivers results that are technically >>>>> "correct" but arrive too late to be truly useful for certain high-stakes, >>>>> real-time scenarios. This makes them "useless or wrong" in a practical, >>>>> business-critical sense. >>>>> >>>>> For example *in real-time fraud detection* and In *high-frequency >>>>> trading,* market data or trade execution commands must be delivered >>>>> with minimal latency. Even a slight delay can mean missed opportunities or >>>>> significant financial losses, making a "correct" price update useless if >>>>> it's not instantaneous. able for these demanding use cases, where a >>>>> "late but correct" answer is simply not good enough. As a colliery it is a >>>>> fundamental concept, so it has to be treated as such not as a >>>>> comment.in SPIP >>>>> >>>>> Hope this clarifies the connection in practical terms >>>>> Dr Mich Talebzadeh, >>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote: >>>>> >>>>>> Hey Mich, >>>>>> >>>>>> Sorry, I may be missing something here but what does your definition >>>>>> here have to do with the SPIP? Perhaps add comments directly to the >>>>>> SPIP >>>>>> to provide context as the code snippet below is a direct copy from the >>>>>> SPIP >>>>>> itself. >>>>>> >>>>>> Thanks, >>>>>> Denny >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> just to add >>>>>>> >>>>>>> A stronger definition of real time. The engineering definition of >>>>>>> real time is roughly fast enough to be interactive >>>>>>> >>>>>>> However, I put a stronger definition. In real time application or >>>>>>> data, there is nothing as an answer which is supposed to be late and >>>>>>> correct. The timeliness is part of the application.if I get the right >>>>>>> answer too slowly it becomes useless or wrong >>>>>>> >>>>>>> >>>>>>> >>>>>>> Dr Mich Talebzadeh, >>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> The current limitations in SSS come from micro-batching.If you are >>>>>>>> going to reduce micro-batching, this reduction must be balanced >>>>>>>> against the >>>>>>>> available processing capacity of the cluster to prevent back pressure >>>>>>>> and >>>>>>>> instability. In the case of Continuous Processing mode, a specific >>>>>>>> continuous trigger with a desired checkpoint interval quote >>>>>>>> >>>>>>>> " >>>>>>>> df.writeStream >>>>>>>> .format("...") >>>>>>>> .option("...") >>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger type >>>>>>>> to enable real-time Mode >>>>>>>> .start() >>>>>>>> This Trigger.RealTime signals that the query should run in the new >>>>>>>> ultra low-latency execution mode. A time interval can also be >>>>>>>> specified, >>>>>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run >>>>>>>> for. >>>>>>>> " >>>>>>>> >>>>>>>> will inevitably depend on many factors. Not that simple >>>>>>>> HTH >>>>>>>> >>>>>>>> >>>>>>>> Dr Mich Talebzadeh, >>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | >>>>>>>> GDPR >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < >>>>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I want to start a discussion thread for the SPIP titled “Real-Time >>>>>>>>> Mode in Apache Spark Structured Streaming” that I've been working on >>>>>>>>> with >>>>>>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael >>>>>>>>> Armbrust: [ >>>>>>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>>>>>>> ]. >>>>>>>>> >>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in >>>>>>>>> Spark Structured Streaming that significantly lowers end-to-end >>>>>>>>> latency for >>>>>>>>> processing streams of data. >>>>>>>>> >>>>>>>>> A key principle of this proposal is compatibility. Our goal is to >>>>>>>>> make Spark capable of handling streaming jobs that need results almost >>>>>>>>> immediately (within O(100) milliseconds). We want to achieve this >>>>>>>>> without >>>>>>>>> changing the high-level DataFrame/Dataset API that users already use >>>>>>>>> – so >>>>>>>>> existing streaming queries can run in this new ultra-low-latency mode >>>>>>>>> by >>>>>>>>> simply turning it on, without rewriting their logic. >>>>>>>>> >>>>>>>>> In short, we’re trying to enable Spark to power real-time >>>>>>>>> applications (like instant anomaly alerts or live personalization) >>>>>>>>> that >>>>>>>>> today cannot meet their latency requirements with Spark’s current >>>>>>>>> streaming >>>>>>>>> engine. >>>>>>>>> >>>>>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions >>>>>>>>> on this approach! >>>>>>>>> >>>>>>>>> >>> >>> -- >>> Best, >>> Yanbo >>> >>