Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Mich Talebzadeh Fri, 30 May 2025 13:53:06 -0700

ok fair points

This SPIP (Structured Streaming, in this context) admittedly does not meet
the rigorous, academic definition of a soft real-time system, due to the
lack of explicit, guaranteed deadlines and internal mechanisms for handling
missed frames.


Having said that, despite not being a "strict" soft real-time system, the
SPIP offers significant advantages that make it highly valuable for many
real-world applications:

   - Lower Latency: Much faster than traditional batch processing.
   - High Throughput & Scalability: Handles large volumes of data
   effectively.
   - Robust Fault Tolerance: Reliable operation even with failures.
   - Ease of Integration: Fits well into existing data architectures
   without significant re-writes

"Real-time Enough" for Many: In other words for a vast number of business
use cases, where delays are undesirable but not catastrophic (i.e., not
life-or-death or causing immediate, irreparable damage), the SPIP's
performance is sufficient. This addresses the practical reality of how
"real-time" is used in the industry.

Proposed Better Terminology -> One can propose for more precise terms like
"near real-time streaming" or "interactive streaming" to accurately
describe the system's capabilities and bridge the gap between academic
rigor and practical industry usage. This IMO is a good suggestion to reduce
ambiguity.

HTH

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Fri, 30 May 2025 at 20:39, Mark Hamstra <markhams...@gmail.com> wrote:

> A soft real-time system still defines an interval or frame within which
> results should be available, and often provides explicit warning or
> error-handling mechanisms when frame rates are missed. I see nothing like
> that in the SPIP. Instead, the length of the underlying microbatches is
> specified in the Trigger, but result reporting is just as quickly as
> possible with no reporting interval or frame rate specified and nothing
> that I can see happening if results take longer than the user is guessing
> or expecting. That's a low-latency, "we'll do it as fast as we can, but no
> promises or guarantees" system, not real-time.
>
> On Thu, May 29, 2025 at 11:57 PM Jerry Peng <jerry.boyang.p...@gmail.com>
> wrote:
>
>> Mark,
>>
>> For real-time systems there is a concept of "soft" real-time and "hard"
>> real-time systems.  These concepts exist in textbooks.  Here is a document
>> by intel that explains it:
>>
>>
>> https://www.intel.com/content/www/us/en/learn/what-is-a-real-time-system.html
>>
>> "In a soft real-time system, computers or equipment will continue to
>> function after a missed deadline but may produce a lower-quality output.
>> For example, latency in online video games can impact player interactions,
>> but otherwise present no serious consequences."
>>
>> "Hard real-time systems have zero delay tolerance, and delayed signals
>> can result in total failure or present immediate danger to users. Flight
>> control systems and pacemakers are both examples where timeliness is not
>> only essential but the lack of it can result in a life-or-death situation."
>>
>> I don't think it is inaccurate or misleading to call this mode
>> real-time.  It is soft real-time.
>>
>> On Thu, May 29, 2025 at 11:44 PM Mark Hamstra <markhams...@gmail.com>
>> wrote:
>>
>>> Clarifying what is meant by "real-time" and explicitly differentiating
>>> it from actual real-time computing should be a bare minimum. I still don't
>>> like the use of marketing-speak "real-time" that isn't really real-time in
>>> engineering documents or API namespaces.
>>>
>>> On Thu, May 29, 2025 at 10:43 PM Jerry Peng <jerry.boyang.p...@gmail.com>
>>> wrote:
>>>
>>>> Mark,
>>>>
>>>> I thought we are simply discussing the naming of the mode?  Like I
>>>> mentioned, if you think simply calling this mode "real-time" mode may cause
>>>> confusion because "real-time" can mean other things in other fields, I can
>>>> clarify what we mean by "real-time" explicitly in the SPIP document and any
>>>> future documentation. That is not a problem and thank you for your 
>>>> feedback.
>>>>
>>>> On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <markhams...@gmail.com>
>>>> wrote:
>>>>
>>>>> Referencing other misuse of "real-time" is not persuasive. A SPIP is
>>>>> an engineering document, not a marketing document. Technical clarity and
>>>>> accuracy should be non-negotiable.
>>>>>
>>>>>
>>>>> On Thu, May 29, 2025 at 10:27 PM Jerry Peng <
>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>
>>>>>> Mark,
>>>>>>
>>>>>> As an example of my point if you go the the Apache Storm (another
>>>>>> stream processing engine) website:
>>>>>>
>>>>>> https://storm.apache.org/
>>>>>>
>>>>>> It describes Storm as:
>>>>>>
>>>>>> "Apache Storm is a free and open source distributed *realtime*
>>>>>> computation system"
>>>>>>
>>>>>> If you can to apache Flink:
>>>>>>
>>>>>>
>>>>>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>>>>>
>>>>>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>>>>>
>>>>>> Thus, what the term "rea-time" implies in this should not be
>>>>>> confusing for folks in this area.
>>>>>>
>>>>>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <
>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>>
>>>>>>> Mich,
>>>>>>>
>>>>>>> If I understood your last email correctly, I think you also wanted
>>>>>>> to have a discussion about naming?  Why are we calling this new 
>>>>>>> execution
>>>>>>> mode described in the SPIP "Real-time Mode"?  Here are my two cents.
>>>>>>> Firstly, "continuous mode" is taken and we want another name to 
>>>>>>> describe an
>>>>>>> execution mode that provides ultra low latency processing.  We could 
>>>>>>> have
>>>>>>> called it "low latency mode", though I don't really like that naming 
>>>>>>> since
>>>>>>> it implies the other execution modes are not low latency which I don't
>>>>>>> believe is true.  This new proposed mode can simply deliver even lower
>>>>>>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, 
>>>>>>> we
>>>>>>> are talking about "soft" real-time here.  I think when we are talking 
>>>>>>> about
>>>>>>> distributed stream processing systems in the space of big data 
>>>>>>> analytics,
>>>>>>> it is reasonable to assume anything described in this space as 
>>>>>>> "real-time"
>>>>>>> implies "soft" real-time.  Though if this is confusing or misleading, we
>>>>>>> can provide clear documentation on what "real-time" in real-time mode 
>>>>>>> means
>>>>>>> and what it guarantees.  Just my thoughts.  I would love to hear other
>>>>>>> perspectives.
>>>>>>>
>>>>>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I think from what I have seen there are a good number of +1
>>>>>>>> responses as opposed to quantitative discussions (based on my 
>>>>>>>> observations
>>>>>>>> only). Given the objectives of the thread, we ought to focus on what is
>>>>>>>> meant by real time compared to continuous   modes.To be fair, it is a
>>>>>>>> common point of confusion, and the terms are often used 
>>>>>>>> interchangeably in
>>>>>>>> general conversation, but in technical contexts, especially with 
>>>>>>>> streaming
>>>>>>>> data platforms, they have specific and important differences.
>>>>>>>>
>>>>>>>> "Continuous Mode" refers to a processing strategy that aims for
>>>>>>>> true, uninterrupted, sub-millisecond latency processing.  Chiefly
>>>>>>>>
>>>>>>>>    - Event-at-a-Time (or very small  batch groups): The system
>>>>>>>>    processes individual events or extremely small groups of events ->
>>>>>>>>    microbatches as they flow through the pipeline.
>>>>>>>>    - Minimal Latency: The primary goal is to achieve the absolute
>>>>>>>>    lowest possible end-to-end latency, often in the order of 
>>>>>>>> milliseconds or
>>>>>>>>    even below
>>>>>>>>    - Most business use cases (say financial markets) can live with
>>>>>>>>    this as they do not rely on rdges
>>>>>>>>
>>>>>>>> Now what is meant by "Real-time Mode"
>>>>>>>>
>>>>>>>> This is where the nuance comes in. "Real-time" is a broader and
>>>>>>>> sometimes more subjective term. When the text introduces "Real-time 
>>>>>>>> Mode"
>>>>>>>> as distinct from "Continuous Mode," it suggests a specific 
>>>>>>>> implementation
>>>>>>>> that achieves real-time characteristics but might do so differently or 
>>>>>>>> more
>>>>>>>> robustly than a "continuous" mode attempt. Going back to my earlier
>>>>>>>> mention, in real time application , there is nothing as an answer 
>>>>>>>> which is
>>>>>>>> supposed to be late and correct. The timeliness is part of the 
>>>>>>>> application.
>>>>>>>> if I get the right answer too slowly it becomes useless or wrong. What 
>>>>>>>> I
>>>>>>>> call the "Late and Correct is Useless" Principle
>>>>>>>>
>>>>>>>> In summary, "Real-time Mode" seems to describe an approach that
>>>>>>>> delivers low-latency processing with high reliability and ease of use,
>>>>>>>> leveraging established, battle-tested components.I invite the audience 
>>>>>>>> to
>>>>>>>> have a discussion on this.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>> GDPR
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <yangji...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>>>>>>>> > +1
>>>>>>>>> >
>>>>>>>>> > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：
>>>>>>>>> >
>>>>>>>>> > > +1.
>>>>>>>>> > >
>>>>>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com>
>>>>>>>>> wrote:
>>>>>>>>> > >
>>>>>>>>> > >> +1
>>>>>>>>> > >> Sent from my iPhone
>>>>>>>>> > >>
>>>>>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> > >>
>>>>>>>>> > >> 
>>>>>>>>> > >> +1 Nice feature
>>>>>>>>> > >>
>>>>>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <
>>>>>>>>> xyliyuanj...@gmail.com>
>>>>>>>>> > >> wrote:
>>>>>>>>> > >>
>>>>>>>>> > >>> +1
>>>>>>>>> > >>>
>>>>>>>>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
>>>>>>>>> > >>>
>>>>>>>>> > >>>> +1, LGTM.
>>>>>>>>> > >>>>
>>>>>>>>> > >>>> Kent
>>>>>>>>> > >>>>
>>>>>>>>> > >>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
>>>>>>>>> > >>>>
>>>>>>>>> > >>>>> +1. Super excited by this initiative!
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <
>>>>>>>>> yblia...@gmail.com>
>>>>>>>>> > >>>>> wrote:
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>>> +1
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>>>>>>>> huaxin.ga...@gmail.com>
>>>>>>>>> > >>>>>> wrote:
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>>> +1
>>>>>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we
>>>>>>>>> can
>>>>>>>>> > >>>>>>> eliminate the need for separate streaming engines,
>>>>>>>>> reducing system
>>>>>>>>> > >>>>>>> complexity and operational cost. Excited to see this
>>>>>>>>> direction!
>>>>>>>>> > >>>>>>>
>>>>>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>>>>>> > >>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>> > >>>>>>>
>>>>>>>>> > >>>>>>>> Hi,
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> My point about "in real time application or data, there
>>>>>>>>> is nothing
>>>>>>>>> > >>>>>>>> as an answer which is supposed to be late and correct.
>>>>>>>>> The timeliness is
>>>>>>>>> > >>>>>>>> part of the application. if I get the right answer too
>>>>>>>>> slowly it becomes
>>>>>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we
>>>>>>>>> need this
>>>>>>>>> > >>>>>>>> Spark Structured Streaming proposal.
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>>>>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of
>>>>>>>>> the answer
>>>>>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current
>>>>>>>>> streaming
>>>>>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often
>>>>>>>>> delivers results that
>>>>>>>>> > >>>>>>>> are technically "correct" but arrive too late to be
>>>>>>>>> truly useful for
>>>>>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes
>>>>>>>>> them "useless or
>>>>>>>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>>>>>>>> *high-frequency
>>>>>>>>> > >>>>>>>> trading,* market data or trade execution commands must
>>>>>>>>> be
>>>>>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can
>>>>>>>>> mean missed
>>>>>>>>> > >>>>>>>> opportunities or significant financial losses, making a
>>>>>>>>> "correct" price
>>>>>>>>> > >>>>>>>> update useless if it's not instantaneous. able for
>>>>>>>>> these demanding
>>>>>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply
>>>>>>>>> not good enough. As
>>>>>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>>>>>>>> treated as such not
>>>>>>>>> > >>>>>>>> as a comment.in SPIP
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>>>>>>>> > >>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>>>>> Analysis |
>>>>>>>>> > >>>>>>>> GDPR
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>    view my Linkedin profile
>>>>>>>>> > >>>>>>>> <
>>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>>>>>>>> denny.g....@gmail.com>
>>>>>>>>> > >>>>>>>> wrote:
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>> Hey Mich,
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does
>>>>>>>>> your
>>>>>>>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps
>>>>>>>>> add comments directly
>>>>>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet
>>>>>>>>> below is a direct copy
>>>>>>>>> > >>>>>>>>> from the SPIP itself.
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>> Thanks,
>>>>>>>>> > >>>>>>>>> Denny
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>>>> > >>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>> just to add
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>>>>>>>> definition of
>>>>>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>>>>>>>> application or
>>>>>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed
>>>>>>>>> to be late and
>>>>>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if
>>>>>>>>> I get the right
>>>>>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>>>>> Analysis |
>>>>>>>>> > >>>>>>>>>> GDPR
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>    view my Linkedin profile
>>>>>>>>> > >>>>>>>>>> <
>>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>>>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> The current limitations in SSS come from
>>>>>>>>> micro-batching.If you
>>>>>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction
>>>>>>>>> must be balanced against
>>>>>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>>>>>>>> prevent back pressure
>>>>>>>>> > >>>>>>>>>>> and instability. In the case of Continuous
>>>>>>>>> Processing mode, a
>>>>>>>>> > >>>>>>>>>>> specific continuous trigger with a desired
>>>>>>>>> checkpoint interval quote
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> "
>>>>>>>>> > >>>>>>>>>>> df.writeStream
>>>>>>>>> > >>>>>>>>>>>    .format("...")
>>>>>>>>> > >>>>>>>>>>>    .option("...")
>>>>>>>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    //
>>>>>>>>> new trigger
>>>>>>>>> > >>>>>>>>>>> type to enable real-time Mode
>>>>>>>>> > >>>>>>>>>>>    .start()
>>>>>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should
>>>>>>>>> run in the
>>>>>>>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time
>>>>>>>>> interval can also be
>>>>>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long
>>>>>>>>> each micro-batch should
>>>>>>>>> > >>>>>>>>>>> run for.
>>>>>>>>> > >>>>>>>>>>> "
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that
>>>>>>>>> simple
>>>>>>>>> > >>>>>>>>>>> HTH
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime |
>>>>>>>>> Forensic Analysis |
>>>>>>>>> > >>>>>>>>>>> GDPR
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>    view my Linkedin profile
>>>>>>>>> > >>>>>>>>>>> <
>>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> Hi all,
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP
>>>>>>>>> titled
>>>>>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured
>>>>>>>>> Streaming” that I've been
>>>>>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao
>>>>>>>>> Sun, Jungtaek Lim, and
>>>>>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>]
>>>>>>>>> [Doc
>>>>>>>>> > >>>>>>>>>>>> <
>>>>>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>>>>>>>> >
>>>>>>>>> > >>>>>>>>>>>> ].
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called
>>>>>>>>> “Real-time Mode”
>>>>>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly
>>>>>>>>> lowers end-to-end latency
>>>>>>>>> > >>>>>>>>>>>> for processing streams of data.
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility.
>>>>>>>>> Our goal is
>>>>>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs
>>>>>>>>> that need results almost
>>>>>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want
>>>>>>>>> to achieve this without
>>>>>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that
>>>>>>>>> users already use – so
>>>>>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>>>>>>>> ultra-low-latency mode by
>>>>>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power
>>>>>>>>> real-time
>>>>>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>>>>>>>> personalization) that
>>>>>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with
>>>>>>>>> Spark’s current streaming
>>>>>>>>> > >>>>>>>>>>>> engine.
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>>>>>> > >>>>>>>>>>>> suggestions on this approach!
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> --
>>>>>>>>> > >>>>>> Best,
>>>>>>>>> > >>>>>> Yanbo
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>
>>>>>>>>> > >>
>>>>>>>>> > >> --
>>>>>>>>> > >> John Zhuge
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to