Re: [DISCUSS] FLIP-423 ～FLIP-428: Introduce Disaggregated State Storage and Management in Flink 2.0

Zakelly Lan Mon, 18 Mar 2024 22:03:20 -0700

Hi Xintong,

Thanks for sharing your thoughts!


1. Regarding Record-ordered and State-ordered of processElement.
>
> I understand that while State-ordered likely provides better performance,
> Record-ordered is sometimes required for correctness. The question is how
> should a user choose between these two modes? My concern is that such a
> decision may require users to have in-depth knowledge about the Flink
> internals, and may lead to correctness issues if State-ordered is chosen
> improperly.
>
> I'd suggest not to expose such a knob, at least in the first version. That
> means always use Record-ordered for custom operators / UDFs, and keep
> State-ordered for internal usages (built-in operators) only.
>

Indeed, users may not be able to choose the mode properly. I agree to keep
such options for internal use.


2. Regarding Strictly-ordered and Out-of-order of Watermarks.
>
> I'm not entirely sure about Strictly-ordered being the default, or even
> being supported. From my understanding, a Watermark(T) only suggests that
> all records with event time before T has arrived, and it has nothing to do
> with whether records with event time after T has arrived or not. From that
> perspective, preventing certain records from arriving before a Watermark is
> never supported. I also cannot come up with any use case where
> Strictly-ordered is necessary. This implies the same issue as 1): how does
> the user choose between the two modes?
>
> I'd suggest not expose the knob to users and only support Out-of-order,
> until we see a concrete use case that Strictly-ordered is needed.
>

The semantics of watermarks do not define the sequence between a watermark
and subsequent records. For the most part, this is inconsequential, except
it may affect some current users who have previously relied on the implicit
assumption of an ordered execution. I'd be fine with initially supporting
only out-of-order processing. We may consider exposing the
'Strictly-ordered' mode once we encounter a concrete use case that
necessitates it.


My philosophies behind not exposing the two config options are:
> - There are already too many options in Flink that barely know how to use
> them. I think Flink should try as much as possible to decide its own
> behavior, rather than throwing all the decisions to the users.
> - It's much harder to take back knobs than to introduce them. Therefore,
> options should be introduced only if concrete use cases are identified.
>

I agree to keep minimal configurable items especially for the MVP. Given
that we have the opportunity to refine the functionality before the
framework transitions from @Experimental to @PublicEvolving, it makes sense
to refrain from presenting user-facing options until we have ensured
their necessity.


Best,
Zakelly

On Tue, Mar 19, 2024 at 12:06 PM Xintong Song <[email protected]> wrote:

> Sorry for joining the discussion late.
>
> I have two questions about FLIP-425.
>
> 1. Regarding Record-ordered and State-ordered of processElement.
>
> I understand that while State-ordered likely provides better performance,
> Record-ordered is sometimes required for correctness. The question is how
> should a user choose between these two modes? My concern is that such a
> decision may require users to have in-depth knowledge about the Flink
> internals, and may lead to correctness issues if State-ordered is chosen
> improperly.
>
> I'd suggest not to expose such a knob, at least in the first version. That
> means always use Record-ordered for custom operators / UDFs, and keep
> State-ordered for internal usages (built-in operators) only.
>
> 2. Regarding Strictly-ordered and Out-of-order of Watermarks.
>
> I'm not entirely sure about Strictly-ordered being the default, or even
> being supported. From my understanding, a Watermark(T) only suggests that
> all records with event time before T has arrived, and it has nothing to do
> with whether records with event time after T has arrived or not. From that
> perspective, preventing certain records from arriving before a Watermark is
> never supported. I also cannot come up with any use case where
> Strictly-ordered is necessary. This implies the same issue as 1): how does
> the user choose between the two modes?
>
> I'd suggest not expose the knob to users and only support Out-of-order,
> until we see a concrete use case that Strictly-ordered is needed.
>
>
> My philosophies behind not exposing the two config options are:
> - There are already too many options in Flink that barely know how to use
> them. I think Flink should try as much as possible to decide its own
> behavior, rather than throwing all the decisions to the users.
> - It's much harder to take back knobs than to introduce them. Therefore,
> options should be introduced only if concrete use cases are identified.
>
> WDYT?
>
> Best,
>
> Xintong
>
>
>
> On Fri, Mar 8, 2024 at 2:45 AM Jing Ge <[email protected]> wrote:
>
> > +1 for Gyula's suggestion. I just finished FLIP-423 which introduced the
> > intention of the big change and high level architecture. Great content
> btw!
> > The only public interface change for this FLIP is one new config to use
> > ForSt. It makes sense to have one dedicated discussion thread for each
> > concrete system design.
> >
> > @Zakelly The links in your mail do not work except the last one, because
> > the FLIP-xxx has been included into the url like
> > https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864FLIP-425
> .
> >
> > NIT fix:
> >
> > FLIP-424:
> https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
> >
> > FLIP-425:
> https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
> >
> > FLIP-426:
> https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrf
> >
> > FLIP-427:
> https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
> >
> > FLIP-428:
> https://lists.apache.org/thread/vr8f91p715ct4lop6b3nr0fh4z5p312b
> >
> > Best regards,
> > Jing
> >
> >
> >
> >
> > On Thu, Mar 7, 2024 at 10:14 AM Zakelly Lan <[email protected]>
> wrote:
> >
> > > Hi everyone,
> > >
> > > Thank you all for a lively discussion here, and it is a good time to
> move
> > > forward to more detailed discussions. Thus we open several threads for
> > > sub-FLIPs:
> > >
> > > FLIP-424:
> > https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
> > > FLIP-425
> > > <
> > https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864FLIP-425
> >:
> > > https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
> > > FLIP-426
> > > <
> > https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0hFLIP-426
> >:
> > > https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrf
> > > FLIP-427
> > > <
> > https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrfFLIP-427
> >:
> > > https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
> > > FLIP-428
> > > <
> > https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ftFLIP-428
> >:
> > > https://lists.apache.org/thread/vr8f91p715ct4lop6b3nr0fh4z5p312b
> > >
> > > If you want to talk about the overall architecture, roadmap, milestones
> > or
> > > something related with multiple FLIPs, please post it here. Otherwise
> you
> > > can discuss some details in separate mails. Let's try to avoid repeated
> > > discussion in different threads. I will sync important messages here if
> > > there are any in the above threads.
> > >
> > > And reply to @Jeyhun: We will ensure the content between those FLIPs is
> > > consistent.
> > >
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Thu, Mar 7, 2024 at 2:16 PM Yuan Mei <[email protected]>
> wrote:
> > >
> > > > I have been a bit busy these few weeks and sorry for responding late.
> > > >
> > > > The original thinking of keeping discussion within one thread is for
> > > easier
> > > > tracking and avoid for repeated discussion in different threads.
> > > >
> > > > For details, It might be good to start in different threads if
> needed.
> > > >
> > > > We will think of a way to better organize the discussion.
> > > >
> > > > Best
> > > > Yuan
> > > >
> > > >
> > > > On Thu, Mar 7, 2024 at 4:38 AM Jeyhun Karimov <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > + 1 for the suggestion.
> > > > > Maybe we can the discussion with the FLIPs with minimum
> dependencies
> > > > (from
> > > > > the other new/proposed FLIPs).
> > > > > Based on our discussion on a particular FLIP, the subsequent (or
> its
> > > > > dependent) FLIP(s) can be updated accordingly?
> > > > >
> > > > > Regards,
> > > > > Jeyhun
> > > > >
> > > > > On Wed, Mar 6, 2024 at 5:34 PM Gyula Fóra <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hey all!
> > > > > >
> > > > > > This is a massive improvement / work. I just started going
> through
> > > the
> > > > > > Flips and have a more or less meta comment.
> > > > > >
> > > > > > While it's good to keep the overall architecture discussion
> here, I
> > > > think
> > > > > > we should still have separate discussions for each FLIP where we
> > can
> > > > > > discuss interface details etc. With so much content if we start
> > > adding
> > > > > > minor comments here that will lead to nowhere but those
> discussions
> > > are
> > > > > > still important and we should have them in separate threads (one
> > for
> > > > each
> > > > > > FLIP)
> > > > > >
> > > > > > What do you think?
> > > > > > Gyula
> > > > > >
> > > > > > On Wed, Mar 6, 2024 at 8:50 AM Yanfei Lei <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hi team,
> > > > > > >
> > > > > > > Thanks for your discussion. Regarding FLIP-425, we have
> > > supplemented
> > > > > > > several updates to answer high-frequency questions:
> > > > > > >
> > > > > > > 1. We captured a flame graph of the Hashmap state backend in
> > > > > > > "Synchronous execution with asynchronous APIs"[1], which
> reveals
> > > that
> > > > > > > the framework overhead (including reference counting,
> > > future-related
> > > > > > > code and so on) consumes about 9% of the keyed operator CPU
> time.
> > > > > > > 2. We added a set of comparative experiments for watermark
> > > > processing,
> > > > > > > the performance of Out-Of-Order mode is 70% better than
> > > > > > > strictly-ordered mode under ~140MB state size. Instructions on
> > how
> > > to
> > > > > > > run this test have also been added[2].
> > > > > > > 3. Regarding the order of StreamRecord, whether it has state
> > access
> > > > or
> > > > > > > not. We supplemented a new *Strict order of
> 'processElement'*[3].
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-SynchronousexecutionwithasynchronousAPIs
> > > > > > > [2]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-Strictly-orderedmodevs.Out-of-ordermodeforwatermarkprocessing
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-ElementOrder
> > > > > > >
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Yanfei
> > > > > > >
> > > > > > > Yunfeng Zhou <[email protected]> 于2024年3月5日周二
> 09:25写道：
> > > > > > > >
> > > > > > > > Hi Zakelly,
> > > > > > > >
> > > > > > > > > 5. I'm not very sure ... revisiting this later since it is
> > not
> > > > > > > important.
> > > > > > > >
> > > > > > > > It seems that we still have some details to confirm about
> this
> > > > > > > > question. Let's postpone this to after the critical parts of
> > the
> > > > > > > > design are settled.
> > > > > > > >
> > > > > > > > > 8. Yes, we had considered ... metrics should be like
> > > afterwards.
> > > > > > > >
> > > > > > > > Oh sorry I missed FLIP-431. I'm fine with discussing this
> topic
> > > in
> > > > > > > milestone 2.
> > > > > > > >
> > > > > > > > Looking forward to the detailed design about the strict mode
> > > > between
> > > > > > > > same-key records and the benchmark results about the epoch
> > > > mechanism.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Yunfeng
> > > > > > > >
> > > > > > > > On Mon, Mar 4, 2024 at 7:59 PM Zakelly Lan <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Yunfeng,
> > > > > > > > >
> > > > > > > > > For 1:
> > > > > > > > > I had a discussion with Lincoln Lee, and I realize it is a
> > > common
> > > > > > case
> > > > > > > the same-key record should be blocked before the
> > `processElement`.
> > > It
> > > > > is
> > > > > > > easier for users to understand. Thus I will introduce a strict
> > mode
> > > > for
> > > > > > > this and make it default. My rough idea is just like yours, by
> > > > invoking
> > > > > > > some method of AEC instance before `processElement`. The
> detailed
> > > > > design
> > > > > > > will be described in FLIP later.
> > > > > > > > >
> > > > > > > > > For 2:
> > > > > > > > > I agree with you. We could throw exceptions for now and
> > > optimize
> > > > > this
> > > > > > > later.
> > > > > > > > >
> > > > > > > > > For 5:
> > > > > > > > >>
> > > > > > > > >> It might be better to move the default values to the
> > Proposed
> > > > > > Changes
> > > > > > > > >> section instead of making them public for now, as there
> will
> > > be
> > > > > > > > >> compatibility issues once we want to dynamically adjust
> the
> > > > > > thresholds
> > > > > > > > >> and timeouts in future.
> > > > > > > > >
> > > > > > > > > Agreed. The whole framework is under experiment until we
> > think
> > > it
> > > > > is
> > > > > > > complete in 2.0 or later. The default value should be better
> > > > determined
> > > > > > > with more testing results and production experience.
> > > > > > > > >
> > > > > > > > >> The configuration execution.async-state.enabled seems
> > > > unnecessary,
> > > > > > as
> > > > > > > > >> the infrastructure may automatically get this information
> > from
> > > > the
> > > > > > > > >> detailed state backend configurations. We may revisit this
> > > part
> > > > > > after
> > > > > > > > >> the core designs have reached an agreement. WDYT?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm not very sure if there is any use case where users
> write
> > > > their
> > > > > > > code using async APIs but run their job in a synchronous way.
> The
> > > > first
> > > > > > two
> > > > > > > scenarios that come to me are for benchmarking or for a small
> > > state,
> > > > > > while
> > > > > > > they don't want to rewrite their code. Actually it is easy to
> > > > support,
> > > > > so
> > > > > > > I'd suggest providing it. But I'm fine with revisiting this
> later
> > > > since
> > > > > > it
> > > > > > > is not important. WDYT?
> > > > > > > > >
> > > > > > > > > For 8:
> > > > > > > > > Yes, we had considered the I/O metrics group especially the
> > > > > > > back-pressure, idle and task busy per second. In the current
> plan
> > > we
> > > > > can
> > > > > > do
> > > > > > > state access during back-pressure, meaning that those metrics
> for
> > > > input
> > > > > > > would better be redefined. I suggest we discuss these existing
> > > > metrics
> > > > > as
> > > > > > > well as some new metrics that should be introduced in FLIP-431
> > > later
> > > > in
> > > > > > > milestone 2, since we have basically finished the framework
> thus
> > we
> > > > > will
> > > > > > > have a better view of what metrics should be like afterwards.
> > WDYT?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Zakelly
> > > > > > > > >
> > > > > > > > > On Mon, Mar 4, 2024 at 6:49 PM Yunfeng Zhou <
> > > > > > > [email protected]> wrote:
> > > > > > > > >>
> > > > > > > > >> Hi Zakelly,
> > > > > > > > >>
> > > > > > > > >> Thanks for the responses!
> > > > > > > > >>
> > > > > > > > >> > 1. I will discuss this with some expert SQL developers.
> > ...
> > > > mode
> > > > > > > for StreamRecord processing.
> > > > > > > > >>
> > > > > > > > >> In DataStream API there should also be use cases when the
> > > order
> > > > of
> > > > > > > > >> output is strictly required. I agree with it that SQL
> > experts
> > > > may
> > > > > > help
> > > > > > > > >> provide more concrete use cases that can accelerate our
> > > > > discussion,
> > > > > > > > >> but please allow me to search for DataStream use cases
> that
> > > can
> > > > > > prove
> > > > > > > > >> the necessity of this strict order preservation mode, if
> > > answers
> > > > > > from
> > > > > > > > >> SQL experts are shown to be negative.
> > > > > > > > >>
> > > > > > > > >> For your convenience, my current rough idea is that we can
> > > add a
> > > > > > > > >> module between the Input(s) and processElement() module in
> > > Fig 2
> > > > > of
> > > > > > > > >> FLIP-425. The module will be responsible for caching
> records
> > > > whose
> > > > > > > > >> keys collide with in-flight records, and AEC will only be
> > > > > > responsible
> > > > > > > > >> for handling async state calls, without knowing the record
> > > each
> > > > > call
> > > > > > > > >> belongs to. We may revisit this topic once the necessity
> of
> > > the
> > > > > > strict
> > > > > > > > >> order mode is clarified.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> > 2. The amount of parallel StateRequests ... instead of
> > > > invoking
> > > > > > > yield
> > > > > > > > >>
> > > > > > > > >> Your suggestions generally appeal to me. I think we may
> let
> > > > > > > > >> corresponding Flink jobs fail with OOM for now, since the
> > > > majority
> > > > > > of
> > > > > > > > >> a StateRequest should just be references to existing Java
> > > > objects,
> > > > > > > > >> which only occupies very small memory space and can hardly
> > > cause
> > > > > OOM
> > > > > > > > >> in common cases. We can monitor the pending StateRequests
> > and
> > > if
> > > > > > there
> > > > > > > > >> is really a risk of OOM in extreme cases, we can throw
> > > > Exceptions
> > > > > > with
> > > > > > > > >> proper messages notifying users what to do, like
> increasing
> > > > memory
> > > > > > > > >> through configurations.
> > > > > > > > >>
> > > > > > > > >> Your suggestions to adjust threshold adaptively or to use
> > the
> > > > > > blocking
> > > > > > > > >> buffer sounds good, and in my opinion we can postpone them
> > to
> > > > > future
> > > > > > > > >> FLIPs since they seem to only benefit users in rare cases.
> > > Given
> > > > > > that
> > > > > > > > >> FLIP-423~428 has already been a big enough design, it
> might
> > be
> > > > > > better
> > > > > > > > >> to focus on the most critical design for now and postpone
> > > > > > > > >> optimizations like this. WDYT?
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> > 5. Sure, we will introduce new configs as well as their
> > > > default
> > > > > > > value.
> > > > > > > > >>
> > > > > > > > >> Thanks for adding the default values and the values
> > themselves
> > > > > LGTM.
> > > > > > > > >> It might be better to move the default values to the
> > Proposed
> > > > > > Changes
> > > > > > > > >> section instead of making them public for now, as there
> will
> > > be
> > > > > > > > >> compatibility issues once we want to dynamically adjust
> the
> > > > > > thresholds
> > > > > > > > >> and timeouts in future.
> > > > > > > > >>
> > > > > > > > >> The configuration execution.async-state.enabled seems
> > > > unnecessary,
> > > > > > as
> > > > > > > > >> the infrastructure may automatically get this information
> > from
> > > > the
> > > > > > > > >> detailed state backend configurations. We may revisit this
> > > part
> > > > > > after
> > > > > > > > >> the core designs have reached an agreement. WDYT?
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Besides, inspired by Jeyhun's comments, it comes to me
> that
> > > > > > > > >>
> > > > > > > > >> 8. Should this FLIP introduce metrics that measure the
> time
> > a
> > > > > Flink
> > > > > > > > >> job is back-pressured by State IOs? Under the current
> > design,
> > > > this
> > > > > > > > >> metric could measure the time when the blocking buffer is
> > full
> > > > and
> > > > > > > > >> yield() cannot get callbacks to process, which means the
> > > > operator
> > > > > is
> > > > > > > > >> fully waiting for state responses.
> > > > > > > > >>
> > > > > > > > >> Best regards,
> > > > > > > > >> Yunfeng
> > > > > > > > >>
> > > > > > > > >> On Mon, Mar 4, 2024 at 12:33 PM Zakelly Lan <
> > > > > [email protected]>
> > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > Hi Yunfeng,
> > > > > > > > >> >
> > > > > > > > >> > Thanks for your detailed comments!
> > > > > > > > >> >
> > > > > > > > >> >> 1. Why do we need a close() method on StateIterator?
> This
> > > > > method
> > > > > > > seems
> > > > > > > > >> >> unused in the usage example codes.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > The `close()` is introduced to release internal
> resources,
> > > but
> > > > > it
> > > > > > > does not seem to require the user to call it. I removed this.
> > > > > > > > >> >
> > > > > > > > >> >> 2. In FutureUtils.combineAll()'s JavaDoc, it is stated
> > that
> > > > "No
> > > > > > > null
> > > > > > > > >> >> entries are allowed". It might be better to further
> > explain
> > > > > what
> > > > > > > will
> > > > > > > > >> >> happen if a null value is passed, ignoring the value in
> > the
> > > > > > > returned
> > > > > > > > >> >> Collection or throwing exceptions. Given that
> > > > > > > > >> >> FutureUtils.emptyFuture() can be returned in the
> example
> > > > code,
> > > > > I
> > > > > > > > >> >> suppose the former one might be correct.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > The statement "No null entries are allowed" refers to
> the
> > > > > > > parameters, it means some arrayList like [null, StateFuture1,
> > > > > > StateFuture2]
> > > > > > > passed in are not allowed, and an Exception will be thrown.
> > > > > > > > >> >
> > > > > > > > >> >> 1. According to Fig 2 of this FLIP, ... . This
> situation
> > > > should
> > > > > > be
> > > > > > > > >> >> avoided and the order of same-key records should be
> > > strictly
> > > > > > > > >> >> preserved.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > I will discuss this with some expert SQL developers. And
> > if
> > > it
> > > > > is
> > > > > > > valid and common, I suggest a strict order preservation mode
> for
> > > > > > > StreamRecord processing. WDYT?
> > > > > > > > >> >
> > > > > > > > >> >> 2. The FLIP says that StateRequests submitted by
> > Callbacks
> > > > will
> > > > > > not
> > > > > > > > >> >> invoke further yield() methods. Given that yield() is
> > used
> > > > when
> > > > > > > there
> > > > > > > > >> >> is "too much" in-flight data, does it mean
> StateRequests
> > > > > > submitted
> > > > > > > by
> > > > > > > > >> >> Callbacks will never be "too much"? What if the total
> > > number
> > > > of
> > > > > > > > >> >> StateRequests exceed the capacity of Flink operator's
> > > memory
> > > > > > space?
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > The amount of parallel StateRequests for one
> StreamRecord
> > > > cannot
> > > > > > be
> > > > > > > determined since the code is written by users. So the in-flight
> > > > > requests
> > > > > > > may be "too much", and may cause OOM. Users should re-configure
> > > their
> > > > > > job,
> > > > > > > controlling the amount of on-going StreamRecord. And I suggest
> > two
> > > > ways
> > > > > > to
> > > > > > > avoid this:
> > > > > > > > >> >
> > > > > > > > >> > Adaptively adjust the count of on-going StreamRecord
> > > according
> > > > > to
> > > > > > > historical StateRequests amount.
> > > > > > > > >> > Also control the max StateRequests that can be executed
> in
> > > > > > parallel
> > > > > > > for each StreamRecord, and if it exceeds, put the new
> > StateRequest
> > > in
> > > > > the
> > > > > > > blocking buffer waiting for execution (instead of invoking
> > > yield()).
> > > > > > > > >> >
> > > > > > > > >> > WDYT?
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >> 3.1 I'm concerned that the out-of-order execution mode,
> > > along
> > > > > > with
> > > > > > > the
> > > > > > > > >> >> epoch mechanism, would bring more complexity to the
> > > execution
> > > > > > model
> > > > > > > > >> >> than the performance improvement it promises. Could we
> > add
> > > > some
> > > > > > > > >> >> benchmark results proving the benefit of this mode?
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Agreed, will do.
> > > > > > > > >> >
> > > > > > > > >> >> 3.2 The FLIP might need to add a public API section
> > > > describing
> > > > > > how
> > > > > > > > >> >> users or developers can switch between these two
> > execution
> > > > > modes.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Good point. We will add a Public API section.
> > > > > > > > >> >
> > > > > > > > >> >> 3.3 Apart from the watermark and checkpoint mentioned
> in
> > > this
> > > > > > FLIP,
> > > > > > > > >> >> there are also more other events that might appear in
> the
> > > > > stream
> > > > > > of
> > > > > > > > >> >> data records. It might be better to generalize the
> > > execution
> > > > > mode
> > > > > > > > >> >> mechanism to handle all possible events.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Yes, I missed this point. Thanks for the reminder.
> > > > > > > > >> >
> > > > > > > > >> >> 4. It might be better to treat callback-handling as a
> > > > > > > > >> >> MailboxDefaultAction, instead of Mails, to avoid the
> > > overhead
> > > > > of
> > > > > > > > >> >> repeatedly creating Mail objects.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >  I thought the intermediated wrapper for callback can
> not
> > be
> > > > > > > omitted, since there will be some context switch before each
> > > > execution.
> > > > > > The
> > > > > > > MailboxDefaultAction in most cases is processInput right? While
> > the
> > > > > > > callback should be executed with higher priority. I'd suggest
> not
> > > > > > changing
> > > > > > > the basic logic of Mailbox and the default action since it is
> > very
> > > > > > critical
> > > > > > > for performance. But yes, we will try our best to avoid
> creating
> > > > > > > intermediated objects.
> > > > > > > > >> >
> > > > > > > > >> >> 5. Could this FLIP provide the current default values
> for
> > > > > things
> > > > > > > like
> > > > > > > > >> >> active buffer size thresholds and timeouts? These could
> > > help
> > > > > with
> > > > > > > > >> >> memory consumption and latency analysis.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Sure, we will introduce new configs as well as their
> > default
> > > > > > value.
> > > > > > > > >> >
> > > > > > > > >> >> 6. Why do we need to record the hashcode of a record in
> > its
> > > > > > > > >> >> RecordContext? It seems not used.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > The context switch before each callback execution
> involves
> > > > > > > setCurrentKey, where the hashCode is re-calculated. We cache it
> > for
> > > > > > > accelerating.
> > > > > > > > >> >
> > > > > > > > >> >> 7. In "timers can be stored on the JVM heap or
> RocksDB",
> > > the
> > > > > link
> > > > > > > > >> >> points to a document in flink-1.15. It might be better
> to
> > > > > verify
> > > > > > > the
> > > > > > > > >> >> referenced content is still valid in the latest Flink
> and
> > > > > update
> > > > > > > the
> > > > > > > > >> >> link accordingly. Same for other references if any.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Thanks for the reminder! Will check.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Thanks a lot & Best,
> > > > > > > > >> > Zakelly
> > > > > > > > >> >
> > > > > > > > >> > On Sat, Mar 2, 2024 at 6:18 AM Jeyhun Karimov <
> > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > > > >> >>
> > > > > > > > >> >> Hi,
> > > > > > > > >> >>
> > > > > > > > >> >> Thanks for the great proposals. I have a few comments
> > > > comments:
> > > > > > > > >> >>
> > > > > > > > >> >> - Backpressure Handling. Flink's original backpressure
> > > > handling
> > > > > > is
> > > > > > > quite
> > > > > > > > >> >> robust and the semantics is quite "simple" (simple is
> > > > > beautiful).
> > > > > > > > >> >> This mechanism has proven to perform better/robust than
> > the
> > > > > other
> > > > > > > open
> > > > > > > > >> >> source streaming systems, where they were relying on
> some
> > > > > > loopback
> > > > > > > > >> >> information.
> > > > > > > > >> >> Now that the proposal also relies on loopback (yield in
> > > this
> > > > > > > case), it is
> > > > > > > > >> >> not clear how well the new backpressure handling
> proposed
> > > in
> > > > > > > FLIP-425 is
> > > > > > > > >> >> robust and handle fluctuating workloads.
> > > > > > > > >> >>
> > > > > > > > >> >> - Watermark/Timer Handling: Similar arguments apply for
> > > > > watermark
> > > > > > > and timer
> > > > > > > > >> >> handling. IMHO, we need more benchmarks showing the
> > > overhead
> > > > > > > > >> >> of epoch management with different parameters (e.g.,
> > window
> > > > > size,
> > > > > > > watermark
> > > > > > > > >> >> strategy, etc)
> > > > > > > > >> >>
> > > > > > > > >> >> - DFS consistency guarantees. The proposal in FLIP-427
> is
> > > > > > > DFS-agnostic.
> > > > > > > > >> >> However, different cloud providers have different
> storage
> > > > > > > consistency
> > > > > > > > >> >> models.
> > > > > > > > >> >> How do we want to deal with them?
> > > > > > > > >> >>
> > > > > > > > >> >>  Regards,
> > > > > > > > >> >> Jeyhun
> > > > > > > > >> >>
> > > > > > > > >> >>
> > > > > > > > >> >>
> > > > > > > > >> >>
> > > > > > > > >> >> On Fri, Mar 1, 2024 at 6:08 PM Zakelly Lan <
> > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > > > >> >>
> > > > > > > > >> >> > Thanks Piotr for sharing your thoughts!
> > > > > > > > >> >> >
> > > > > > > > >> >> > I guess it depends how we would like to treat the
> local
> > > > > disks.
> > > > > > > I've always
> > > > > > > > >> >> > > thought about them that almost always eventually
> all
> > > > state
> > > > > > > from the DFS
> > > > > > > > >> >> > > should end up cached in the local disks.
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > OK I got it. In our proposal we treat local disk as
> an
> > > > > optional
> > > > > > > cache, so
> > > > > > > > >> >> > the basic design will handle the case with state
> > residing
> > > > in
> > > > > > DFS
> > > > > > > only. It
> > > > > > > > >> >> > is a more 'cloud-native' approach that does not rely
> on
> > > any
> > > > > > > local storage
> > > > > > > > >> >> > assumptions, which allow users to dynamically adjust
> > the
> > > > > > > capacity or I/O
> > > > > > > > >> >> > bound of remote storage to gain performance or save
> the
> > > > cost,
> > > > > > > even without
> > > > > > > > >> >> > a job restart.
> > > > > > > > >> >> >
> > > > > > > > >> >> > In
> > > > > > > > >> >> > > the currently proposed more fine grained solution,
> > you
> > > > > make a
> > > > > > > single
> > > > > > > > >> >> > > request to DFS per each state access.
> > > > > > > > >> >> > >
> > > > > > > > >> >> >
> > > > > > > > >> >> > Ah that's not accurate. Actually we buffer the state
> > > > requests
> > > > > > > and process
> > > > > > > > >> >> > them in batch, multiple requests will correspond to
> one
> > > DFS
> > > > > > > access (One
> > > > > > > > >> >> > block access for multiple keys performed by RocksDB).
> > > > > > > > >> >> >
> > > > > > > > >> >> > In that benchmark you mentioned, are you requesting
> the
> > > > state
> > > > > > > > >> >> > > asynchronously from local disks into memory? If the
> > > > benefit
> > > > > > > comes from
> > > > > > > > >> >> > > parallel I/O, then I would expect the benefit to
> > > > > > > disappear/shrink when
> > > > > > > > >> >> > > running multiple subtasks on the same machine, as
> > they
> > > > > would
> > > > > > > be making
> > > > > > > > >> >> > > their own parallel requests, right? Also enabling
> > > > > > > checkpointing would
> > > > > > > > >> >> > > further cut into the available I/O budget.
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > That's an interesting topic. Our proposal is
> > specifically
> > > > > aimed
> > > > > > > at the
> > > > > > > > >> >> > scenario where the machine I/O is not fully loaded
> but
> > > the
> > > > > I/O
> > > > > > > latency has
> > > > > > > > >> >> > indeed become a bottleneck for each subtask. While
> the
> > > > > > > distributed file
> > > > > > > > >> >> > system is a prime example of a scenario characterized
> > by
> > > > > > > abundant and
> > > > > > > > >> >> > easily scalable I/O bandwidth coupled with higher I/O
> > > > > latency.
> > > > > > > You may
> > > > > > > > >> >> > expect to increase the parallelism of a job to
> enhance
> > > the
> > > > > > > performance as
> > > > > > > > >> >> > well, but that also brings in more waste of CPU's and
> > > > memory
> > > > > > for
> > > > > > > building
> > > > > > > > >> >> > up more subtasks. This is one drawback for the
> > > > > > > computation-storage tightly
> > > > > > > > >> >> > coupled nodes. While in our proposal, the parallel
> I/O
> > > with
> > > > > all
> > > > > > > the
> > > > > > > > >> >> > callbacks still running in one task, pre-allocated
> > > > > > computational
> > > > > > > resources
> > > > > > > > >> >> > are better utilized. It is a much more lightweight
> way
> > to
> > > > > > > perform parallel
> > > > > > > > >> >> > I/O.
> > > > > > > > >> >> >
> > > > > > > > >> >> > Just with what granularity those async requests
> should
> > be
> > > > > made.
> > > > > > > > >> >> > > Making state access asynchronous is definitely the
> > > right
> > > > > way
> > > > > > > to go!
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > I think the current proposal is based on such core
> > ideas:
> > > > > > > > >> >> >
> > > > > > > > >> >> >    - A pure cloud-native disaggregated state.
> > > > > > > > >> >> >    - Fully utilize the given resources and try not to
> > > waste
> > > > > > them
> > > > > > > (including
> > > > > > > > >> >> >    I/O).
> > > > > > > > >> >> >    - The ability to scale isolated resources (I/O or
> > CPU
> > > or
> > > > > > > memory)
> > > > > > > > >> >> >    independently.
> > > > > > > > >> >> >
> > > > > > > > >> >> > We think a fine-grained granularity is more inline
> with
> > > > those
> > > > > > > ideas,
> > > > > > > > >> >> > especially without local disk assumptions and without
> > any
> > > > > waste
> > > > > > > of I/O by
> > > > > > > > >> >> > prefetching. Please note that it is not a replacement
> > of
> > > > the
> > > > > > > original local
> > > > > > > > >> >> > state with synchronous execution. Instead this is a
> > > > solution
> > > > > > > embracing the
> > > > > > > > >> >> > cloud-native era, providing much more scalability and
> > > > > resource
> > > > > > > efficiency
> > > > > > > > >> >> > when handling a *huge state*.
> > > > > > > > >> >> >
> > > > > > > > >> >> > What also worries me a lot in this fine grained model
> > is
> > > > the
> > > > > > > effect on the
> > > > > > > > >> >> > > checkpointing times.
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > Your concerns are very reasonable. Faster
> checkpointing
> > > is
> > > > > > > always a core
> > > > > > > > >> >> > advantage of disaggregated state, but only for the
> > async
> > > > > phase.
> > > > > > > There will
> > > > > > > > >> >> > be some complexity introduced by in-flight requests,
> > but
> > > > I'd
> > > > > > > suggest a
> > > > > > > > >> >> > checkpoint containing those in-flight state requests
> as
> > > > part
> > > > > of
> > > > > > > the state,
> > > > > > > > >> >> > to accelerate the sync phase by skipping the buffer
> > > > draining.
> > > > > > > This makes
> > > > > > > > >> >> > the buffer size have little impact on checkpoint
> time.
> > > And
> > > > > all
> > > > > > > the changes
> > > > > > > > >> >> > keep within the execution model we proposed while the
> > > > > > checkpoint
> > > > > > > barrier
> > > > > > > > >> >> > alignment or handling will not be touched in our
> > > proposal,
> > > > > so I
> > > > > > > guess
> > > > > > > > >> >> > the complexity is relatively controllable. I have
> faith
> > > in
> > > > > that
> > > > > > > :)
> > > > > > > > >> >> >
> > > > > > > > >> >> > Also regarding the overheads, it would be great if
> you
> > > > could
> > > > > > > provide
> > > > > > > > >> >> > > profiling results of the benchmarks that you
> > conducted
> > > to
> > > > > > > verify the
> > > > > > > > >> >> > > results. Or maybe if you could describe the steps
> to
> > > > > > reproduce
> > > > > > > the
> > > > > > > > >> >> > results?
> > > > > > > > >> >> > > Especially "Hashmap (sync)" vs "Hashmap with async
> > > API".
> > > > > > > > >> >> > >
> > > > > > > > >> >> >
> > > > > > > > >> >> > Yes we could profile the benchmarks. And for the
> > > comparison
> > > > > of
> > > > > > > "Hashmap
> > > > > > > > >> >> > (sync)" vs "Hashmap with async API", we conduct a
> > > Wordcount
> > > > > job
> > > > > > > written
> > > > > > > > >> >> > with async APIs but disabling the async execution by
> > > > directly
> > > > > > > completing
> > > > > > > > >> >> > the future using sync state access. This evaluates
> the
> > > > > overhead
> > > > > > > of newly
> > > > > > > > >> >> > introduced modules like 'AEC' in sync execution (even
> > > > though
> > > > > > > they are not
> > > > > > > > >> >> > designed for it). The code will be provided later.
> For
> > > > other
> > > > > > > results of our
> > > > > > > > >> >> > PoC[1], you can follow the instructions here[2] to
> > > > reproduce.
> > > > > > > Since the
> > > > > > > > >> >> > compilation may take some effort, we will directly
> > > provide
> > > > > the
> > > > > > > jar for
> > > > > > > > >> >> > testing next week.
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > And @Yunfeng Zhou, I have noticed your mail but it
> is a
> > > bit
> > > > > > late
> > > > > > > in my
> > > > > > > > >> >> > local time and the next few days are weekends. So I
> > will
> > > > > reply
> > > > > > > to you
> > > > > > > > >> >> > later. Thanks for your response!
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > [1]
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-PoCResults
> > > > > > > > >> >> > [2]
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-Appendix:HowtorunthePoC
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > Best,
> > > > > > > > >> >> > Zakelly
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > On Fri, Mar 1, 2024 at 6:38 PM Yunfeng Zhou <
> > > > > > > [email protected]>
> > > > > > > > >> >> > wrote:
> > > > > > > > >> >> >
> > > > > > > > >> >> > > Hi,
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > Thanks for proposing this design! I just read
> > FLIP-424
> > > > and
> > > > > > > FLIP-425
> > > > > > > > >> >> > > and have some questions about the proposed changes.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > For Async API (FLIP-424)
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 1. Why do we need a close() method on
> StateIterator?
> > > This
> > > > > > > method seems
> > > > > > > > >> >> > > unused in the usage example codes.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 2. In FutureUtils.combineAll()'s JavaDoc, it is
> > stated
> > > > that
> > > > > > > "No null
> > > > > > > > >> >> > > entries are allowed". It might be better to further
> > > > explain
> > > > > > > what will
> > > > > > > > >> >> > > happen if a null value is passed, ignoring the
> value
> > in
> > > > the
> > > > > > > returned
> > > > > > > > >> >> > > Collection or throwing exceptions. Given that
> > > > > > > > >> >> > > FutureUtils.emptyFuture() can be returned in the
> > > example
> > > > > > code,
> > > > > > > I
> > > > > > > > >> >> > > suppose the former one might be correct.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > For Async Execution (FLIP-425)
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 1. According to Fig 2 of this FLIP, if a recordB
> has
> > > its
> > > > > key
> > > > > > > collide
> > > > > > > > >> >> > > with an ongoing recordA, its processElement()
> method
> > > can
> > > > > > still
> > > > > > > be
> > > > > > > > >> >> > > triggered immediately, and then it might be moved
> to
> > > the
> > > > > > > blocking
> > > > > > > > >> >> > > buffer in AEC if it involves state operations. This
> > > means
> > > > > > that
> > > > > > > > >> >> > > recordB's output will precede recordA's output in
> > > > > downstream
> > > > > > > > >> >> > > operators, if recordA involves state operations
> while
> > > > > recordB
> > > > > > > does
> > > > > > > > >> >> > > not. This will harm the correctness of Flink jobs
> in
> > > some
> > > > > use
> > > > > > > cases.
> > > > > > > > >> >> > > For example, in dim table join cases, recordA could
> > be
> > > a
> > > > > > delete
> > > > > > > > >> >> > > operation that involves state access, while recordB
> > > could
> > > > > be
> > > > > > > an insert
> > > > > > > > >> >> > > operation that needs to visit external storage
> > without
> > > > > state
> > > > > > > access.
> > > > > > > > >> >> > > If recordB's output precedes recordA's, then an
> entry
> > > > that
> > > > > is
> > > > > > > supposed
> > > > > > > > >> >> > > to finally exist with recordB's value in the sink
> > table
> > > > > might
> > > > > > > actually
> > > > > > > > >> >> > > be deleted according to recordA's command. This
> > > situation
> > > > > > > should be
> > > > > > > > >> >> > > avoided and the order of same-key records should be
> > > > > strictly
> > > > > > > > >> >> > > preserved.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 2. The FLIP says that StateRequests submitted by
> > > > Callbacks
> > > > > > > will not
> > > > > > > > >> >> > > invoke further yield() methods. Given that yield()
> is
> > > > used
> > > > > > > when there
> > > > > > > > >> >> > > is "too much" in-flight data, does it mean
> > > StateRequests
> > > > > > > submitted by
> > > > > > > > >> >> > > Callbacks will never be "too much"? What if the
> total
> > > > > number
> > > > > > of
> > > > > > > > >> >> > > StateRequests exceed the capacity of Flink
> operator's
> > > > > memory
> > > > > > > space?
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 3. In the "Watermark" section, this FLIP provided
> an
> > > > > > > out-of-order
> > > > > > > > >> >> > > execution mode apart from the default
> > strictly-ordered
> > > > > mode,
> > > > > > > which can
> > > > > > > > >> >> > > optimize performance by allowing more concurrent
> > > > > executions.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 3.1 I'm concerned that the out-of-order execution
> > mode,
> > > > > along
> > > > > > > with the
> > > > > > > > >> >> > > epoch mechanism, would bring more complexity to the
> > > > > execution
> > > > > > > model
> > > > > > > > >> >> > > than the performance improvement it promises. Could
> > we
> > > > add
> > > > > > some
> > > > > > > > >> >> > > benchmark results proving the benefit of this mode?
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 3.2 The FLIP might need to add a public API section
> > > > > > describing
> > > > > > > how
> > > > > > > > >> >> > > users or developers can switch between these two
> > > > execution
> > > > > > > modes.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 3.3 Apart from the watermark and checkpoint
> mentioned
> > > in
> > > > > this
> > > > > > > FLIP,
> > > > > > > > >> >> > > there are also more other events that might appear
> in
> > > the
> > > > > > > stream of
> > > > > > > > >> >> > > data records. It might be better to generalize the
> > > > > execution
> > > > > > > mode
> > > > > > > > >> >> > > mechanism to handle all possible events.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 4. It might be better to treat callback-handling
> as a
> > > > > > > > >> >> > > MailboxDefaultAction, instead of Mails, to avoid
> the
> > > > > overhead
> > > > > > > of
> > > > > > > > >> >> > > repeatedly creating Mail objects.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 5. Could this FLIP provide the current default
> values
> > > for
> > > > > > > things like
> > > > > > > > >> >> > > active buffer size thresholds and timeouts? These
> > could
> > > > > help
> > > > > > > with
> > > > > > > > >> >> > > memory consumption and latency analysis.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 6. Why do we need to record the hashcode of a
> record
> > in
> > > > its
> > > > > > > > >> >> > > RecordContext? It seems not used.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 7. In "timers can be stored on the JVM heap or
> > > RocksDB",
> > > > > the
> > > > > > > link
> > > > > > > > >> >> > > points to a document in flink-1.15. It might be
> > better
> > > to
> > > > > > > verify the
> > > > > > > > >> >> > > referenced content is still valid in the latest
> Flink
> > > and
> > > > > > > update the
> > > > > > > > >> >> > > link accordingly. Same for other references if any.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > Best,
> > > > > > > > >> >> > > Yunfeng Zhou
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > On Thu, Feb 29, 2024 at 2:17 PM Yuan Mei <
> > > > > > > [email protected]> wrote:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Hi Devs,
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > This is a joint work of Yuan Mei, Zakelly Lan,
> > > Jinzhong
> > > > > Li,
> > > > > > > Hangxiang
> > > > > > > > >> >> > Yu,
> > > > > > > > >> >> > > > Yanfei Lei and Feng Wang. We'd like to start a
> > > > discussion
> > > > > > > about
> > > > > > > > >> >> > > introducing
> > > > > > > > >> >> > > > Disaggregated State Storage and Management in
> Flink
> > > > 2.0.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > The past decade has witnessed a dramatic shift in
> > > > Flink's
> > > > > > > deployment
> > > > > > > > >> >> > > mode,
> > > > > > > > >> >> > > > workload patterns, and hardware improvements.
> We've
> > > > moved
> > > > > > > from the
> > > > > > > > >> >> > > > map-reduce era where workers are
> > computation-storage
> > > > > > tightly
> > > > > > > coupled
> > > > > > > > >> >> > > nodes
> > > > > > > > >> >> > > > to a cloud-native world where containerized
> > > deployments
> > > > > on
> > > > > > > Kubernetes
> > > > > > > > >> >> > > > become standard. To enable Flink's Cloud-Native
> > > future,
> > > > > we
> > > > > > > introduce
> > > > > > > > >> >> > > > Disaggregated State Storage and Management that
> > uses
> > > > DFS
> > > > > as
> > > > > > > primary
> > > > > > > > >> >> > > storage
> > > > > > > > >> >> > > > in Flink 2.0, as promised in the Flink 2.0
> Roadmap.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Design Details can be found in FLIP-423[1].
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > This new architecture is aimed to solve the
> > following
> > > > > > > challenges
> > > > > > > > >> >> > brought
> > > > > > > > >> >> > > in
> > > > > > > > >> >> > > > the cloud-native era for Flink.
> > > > > > > > >> >> > > > 1. Local Disk Constraints in containerization
> > > > > > > > >> >> > > > 2. Spiky Resource Usage caused by compaction in
> the
> > > > > current
> > > > > > > state model
> > > > > > > > >> >> > > > 3. Fast Rescaling for jobs with large states
> > > (hundreds
> > > > of
> > > > > > > Terabytes)
> > > > > > > > >> >> > > > 4. Light and Fast Checkpoint in a native way
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > More specifically, we want to reach a consensus
> on
> > > the
> > > > > > > following issues
> > > > > > > > >> >> > > in
> > > > > > > > >> >> > > > this discussion:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > 1. Overall design
> > > > > > > > >> >> > > > 2. Proposed Changes
> > > > > > > > >> >> > > > 3. Design details to achieve Milestone1
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > In M1, we aim to achieve an end-to-end baseline
> > > version
> > > > > > > using DFS as
> > > > > > > > >> >> > > > primary storage and complete core
> functionalities,
> > > > > > including:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > - Asynchronous State APIs (FLIP-424)[2]:
> Introduce
> > > new
> > > > > APIs
> > > > > > > for
> > > > > > > > >> >> > > > asynchronous state access.
> > > > > > > > >> >> > > > - Asynchronous Execution Model (FLIP-425)[3]:
> > > > Implement a
> > > > > > > non-blocking
> > > > > > > > >> >> > > > execution model leveraging the asynchronous APIs
> > > > > introduced
> > > > > > > in
> > > > > > > > >> >> > FLIP-424.
> > > > > > > > >> >> > > > - Grouping Remote State Access (FLIP-426)[4]:
> > Enable
> > > > > > > retrieval of
> > > > > > > > >> >> > remote
> > > > > > > > >> >> > > > state data in batches to avoid unnecessary
> > round-trip
> > > > > costs
> > > > > > > for remote
> > > > > > > > >> >> > > > access
> > > > > > > > >> >> > > > - Disaggregated State Store (FLIP-427)[5]:
> > Introduce
> > > > the
> > > > > > > initial
> > > > > > > > >> >> > version
> > > > > > > > >> >> > > of
> > > > > > > > >> >> > > > the ForSt disaggregated state store.
> > > > > > > > >> >> > > > - Fault Tolerance/Rescale Integration
> > (FLIP-428)[6]:
> > > > > > > Integrate
> > > > > > > > >> >> > > > checkpointing mechanisms with the disaggregated
> > state
> > > > > store
> > > > > > > for fault
> > > > > > > > >> >> > > > tolerance and fast rescaling.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > We will vote on each FLIP in separate threads to
> > make
> > > > > sure
> > > > > > > each FLIP
> > > > > > > > >> >> > > > reaches a consensus. But we want to keep the
> > > discussion
> > > > > > > within a
> > > > > > > > >> >> > focused
> > > > > > > > >> >> > > > thread (this thread) for easier tracking of
> > contexts
> > > to
> > > > > > avoid
> > > > > > > > >> >> > duplicated
> > > > > > > > >> >> > > > questions/discussions and also to think of the
> > > > > > > problem/solution in a
> > > > > > > > >> >> > full
> > > > > > > > >> >> > > > picture.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Looking forward to your feedback
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Best,
> > > > > > > > >> >> > > > Yuan, Zakelly, Jinzhong, Hangxiang, Yanfei and
> Feng
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > [1] https://cwiki.apache.org/confluence/x/R4p3EQ
> > > > > > > > >> >> > > > [2] https://cwiki.apache.org/confluence/x/SYp3EQ
> > > > > > > > >> >> > > > [3] https://cwiki.apache.org/confluence/x/S4p3EQ
> > > > > > > > >> >> > > > [4] https://cwiki.apache.org/confluence/x/TYp3EQ
> > > > > > > > >> >> > > > [5] https://cwiki.apache.org/confluence/x/T4p3EQ
> > > > > > > > >> >> > > > [6] https://cwiki.apache.org/confluence/x/UYp3EQ
> > > > > > > > >> >> > >
> > > > > > > > >> >> >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-423 ～FLIP-428: Introduce Disaggregated State Storage and Management in Flink 2.0

Reply via email to