Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter with an Asynchronous Core

Kartikey Pant Wed, 12 Nov 2025 23:29:19 -0800

Hi Li and David,

Thank you both for the great questions.


Li - Thanks for the support. To your question: Yes, the circuit breaker is
100% per-reporter*. *A faulty sink (one that throws errors) will trip its
own breaker. A slow sink (one that just blocks) will simply be isolated to
a thread in the worker pool. In either case, it will not block or affect
dispatching to any other configured reporters. This isolation is the core
of the new design.

David - Thank you for the detailed operational questions.


   1. *On Benchmarks*: The primary benefit is JobManager stability. The
   benchmark is qualitative: moving the addEvent() call from a blocking I/O
   operation (in sync mode) to a non-blocking, near-constant-time
   queue.offer() (in async-queued mode). The integration tests are
   specifically designed to validate that the caller thread (i.e., JobManager)
   remains unblocked, even when one reporter is slow. On the
   quantitative impact of async vs sync, I would be adding a benchmark of the
   same post implementation.
   2. *On Dropped Events*: You are correct. This is the explicit tradeoff:
   the async dispatcher prioritizes JM stability over guaranteed event
   delivery. The sync dispatcher remains the default for users who need
   guaranteed, blocking delivery. The new metrics (events.droppedCount,
   events.reporter.<name>.latency, events.reporter.<name>.failureCount) are
   the diagnostic tools for operators to identify and resolve the source of
   backpressure.
   3. *On the Queue (Sync vs. Async):* In sync mode, there is no queue; the
   JM thread is the "queue" and blocks. In async mode, we use a bounded
   in-memory queue, configurable via events.dispatcher.queue.size (default:
   1024). We also provide the events.dispatcher.queue.currentUsage gauge,
   which allows operators to set up their own "90% full" alerts as you
   suggested.
   4. *On Using Kafka*: In my opinion, this would be overkill for the
   dispatcher. The goal is lightweight thread isolation. If a user needs
   Kafka's durability, they should implement a KafkaEventReporter. This FLIP
   makes that pattern safe, as the reporter's blocking calls will be isolated
   to the worker pool and won't affect the JM or other reporters.

Thanks,
Kartikey


On Mon, Nov 10, 2025 at 5:44 PM David Radley <[email protected]>
wrote:

> Hi Li,
> Some comments on the Flip:
>
>   *
> Have you created a benchmark for this to showcase the benefits of the
> single thread and the asynchronous approach. I assume we can get more
> throughout, it would be good to quantify the improvement
>   *
> You mention counting dropped events, in back pressure situations, so this
> mean that the metrics will incomplete. It would be good to see a picture of
> this happening and what the user can do in this situation.
>   *
> You talk about the dispatch queue becoming full. Can you detail what you
> mean here and how it is different in the sync and async cases? How big is
> the queue and can it be increased ? Can we put out warnings around 90% full?
>   *
> Would a Kafka queue be overkill / possible to solve this full queue
> problem? The separation between production and consumption that Kafka
> brings seems applicable here.
>
> Kind regards, David.
>
> From: Li Wang <[email protected]>
> Date: Monday, 10 November 2025 at 11:45
> To: [email protected] <[email protected]>
> Subject: [EXTERNAL] Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter
> with an Asynchronous Core
>
> Hi all,
>
> Just want to check on this thread.
>
> I think FLIP-545 is important work. The new async dispatcher will help
> Flink stability a lot when we use custom reporters - it helps unblock the
> JobManager.
>
> Kartikey, I have a quick question on the circuit breaker logic. Is the
> state managed per-reporter (so each reporter is isolated), or will one
> faulty reporter potentially stop dispatches for all reporters? This is a
> key detail for our setup.
>
> Ready to see the [VOTE] start soon. Thank you for the FLIP.
>
> Thanks,
> Li
>
> On Wed, Oct 1, 2025 at 12:32 PM Kartikey Pant <[email protected]>
> wrote:
>
> > Hi all,
> >
> > Circling back on this thread.
> >
> > Thanks to the great feedback from the earlier discussion, the proposal
> has
> > been updated to use a more flexible, interface-based design. The final
> FLIP
> > is available on the Cwiki [1] (thanks, Piotr, for creating the page).
> >
> > My intention is to move this to a formal vote next week.
> >
> > Before I do, please raise any blocking concerns by this Friday, October
> > 3rd. If there are no blocking issues, I will start the [VOTE] thread on
> > Monday.
> >
> > Thanks,
> > Kartikey
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core
> >
> >
> > On Tue, Sep 2, 2025 at 5:00 PM Piotr Nowojski <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > Here you go:
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-545%3A+Hardening+the+Event+Reporter+with+an+Asynchronous+Core
> > >
> > > Best,
> > > Piotrek
> > >
> > > pon., 1 wrz 2025 o 19:37 Kartikey Pant <[email protected]>
> > > napisał(a):
> > >
> > > > Hi all,
> > > >
> > > > Thanks, Aleksandr, for the great suggestion on using an
> > > > interface-based strategy. It's a much cleaner approach that ensures
> > > > backward compatibility while keeping the design extensible.
> > > >
> > > > Based on this feedback, I've updated the FLIP document. The design
> now
> > > > uses an EventDispatcher interface, controlled by a single
> > > > events.dispatcher.type config key, allowing users to opt-in to the
> new
> > > > asynchronous behavior.
> > > >
> > > > I believe the proposal has now stabilized. As I don't have Confluence
> > > > write access, could a committer please help assign an official FLIP
> > > > number this:
> > > >
> > >
> >
> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?tab=t.0
> > > >
> > > > Best,
> > > > Kartikey Pant
> > > >
> > > >
> > > > On Tue, Aug 26, 2025 at 11:13 PM Aleksandr Iushmanov
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hi Kartikey,
> > > > >
> > > > > Thank you for looking into this.
> > > > >
> > > > > I might not be very familiar with the naming conventions in Flink,
> > > > > so please bear with me if my suggestion doesn't make complete
> sense.
> > > > > I suggest introducing a feature flag, something like:
> > > > >
> > > > > > events.reporter.<name>.dispatcher.type
> > > > >
> > > > > which would default to *sync* to make this change backwards
> > compatible.
> > > > >
> > > > > Also, are there any reasons why we would not want to introduce an
> > > > > interface with two implementations?
> > > > > 1. sync: for the existing behaviour.
> > > > > 2. memory-queue: for the proposed implementation with the queue.
> > > > >
> > > > > This way:
> > > > >
> > > > >    - we don't break anything by default
> > > > >    - we can change the default in future releases once it has been
> > > proven
> > > > >    to be stable
> > > > >    - we keep the door open for other implementations (e.g.
> file-based
> > > > queue
> > > > >    or spillover to logs).
> > > > >
> > > > >
> > > > > I look forward to hearing your thoughts on it.
> > > > >
> > > > > Kind regards,
> > > > > Aleksandr Iushmanov
> > > > >
> > > > >
> > > > > On Fri, 22 Aug 2025 at 09:54, Kartikey Pant <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Aleksandr,
> > > > > >
> > > > > > Thanks for the great feedback. Your points on guaranteed delivery
> > and
> > > > the
> > > > > > *FileEventsReporter* are spot on, and I agree with your
> reasoning.
> > > I'll
> > > > > > update the FLIP to incorporate them, as it will make the proposal
> > > much
> > > > > > stronger.
> > > > > >
> > > > > > Regarding the delivery guarantee, I'll add a new configuration
> key,
> > > > > > *events.reporter.<name>.delivery.guarantee*, to allow a choice
> > > between
> > > > two
> > > > > > modes. The default will be best-effort for the asynchronous,
> > > > non-blocking
> > > > > > dispatch. I'll also add a guaranteed mode for a synchronous,
> > blocking
> > > > > > dispatch that bypasses the queue, perfect for the critical
> > > autoscaling
> > > > use
> > > > > > case you mentioned.
> > > > > >
> > > > > > On your question about the *FileEventsReporter*, you're right
> that
> > a
> > > > local
> > > > > > file append is cheap. The async core isn't really designed for
> the
> > > > > > *FileEventsReporter* specifically, but for the general case where
> > > > reporters
> > > > > > write to network sinks (e.g., *OpenTelemetry*) where latency and
> > > > > > backpressure are real concerns. The file reporter is just meant
> to
> > > be a
> > > > > > simple, built-in option for users.
> > > > > >
> > > > > > I'll get these changes into the design doc shortly and will
> follow
> > up
> > > > on
> > > > > > this thread once it's updated. Thanks again for helping improve
> the
> > > > FLIP.
> > > > > >
> > > > > > Best,
> > > > > > Kartikey
> > > > > >
> > > > > > On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov <
> > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Kartikey,
> > > > > > >
> > > > > > > I like the idea and I agree with general direction, thank you
> for
> > > > > > > putting it together!
> > > > > > >
> > > > > > > I have one concern about making this modification "forced",
> imho
> > > > there
> > > > > > > should be a room for "guaranteed important events delivery"
> from
> > > the
> > > > > > > operations point of view. If Flink job is
> > struggling/backpressured
> > > it
> > > > > > > may make sense to emit some events at priority that would be
> used
> > > for
> > > > > > > external triggers like "autoscaling" or external dynamic
> > > > configuration
> > > > > > > tuning.
> > > > > > >
> > > > > > > Imho, interfaces should either allow to choose "sync" vs "non
> > > > guaranteed
> > > > > > > async" delivery for different events (or event reporters). With
> > > > proposal
> > > > > > > "as is" it won't be possible to "ensure" that important
> messages
> > > have
> > > > > > > been delivered and can be actioned by external monitoring
> system.
> > > > Could
> > > > > > > we make "queue / async" behaviour opt-in?
> > > > > > > Second question I had was around FileEventReporter
> > implementation,
> > > > at a
> > > > > > > glance, "append to file" is a fairly cheap operation, do you
> > have a
> > > > > > > concern that amount of events is large enough to have
> significant
> > > > > > > bottleneck on disk IO and requires memory queue?
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > > Aleksandr Iushmanov
> > > > > > >
> > > > > > >
> > > > > > > On 2025/08/19 06:56:36 Kartikey Pant wrote:
> > > > > > >  > Hi everyone,
> > > > > > >  >
> > > > > > >  > I'd like to propose a new FLIP that builds directly on the
> > > > excellent
> > > > > > >  > foundation laid by FLIP-481 (Introduce Event Reporting). For
> > > > anyone
> > > > > > >  > needing context, the original proposal is available here:
> > > > > > >  >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting
> > > > > > >  >
> > > > > > >  > Now that the community has this powerful API, the logical
> next
> > > > step is
> > > > > > >  > to ensure it's fully robust for large-scale production
> > > > environments
> > > > > > >  > where users will be writing their own diverse, custom
> > reporters.
> > > > > > >  >
> > > > > > >  > This proposal focuses on one key enhancement: introducing a
> > > > resilient,
> > > > > > >  > asynchronous dispatch core. The goal is to decouple event
> > > > generation
> > > > > > >  > from the reporter's execution, ensuring that a slow or
> > > > experimental
> > > > > > >  > sink can never impact Flink's core stability.
> > > > > > >  >
> > > > > > >  > I've drafted a detailed design document that I hope can form
> > the
> > > > basis
> > > > > > >  > of this new FLIP:
> > > > > > >  >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing
> > > > > > >  >
> > > > > > >  > I'm keen to get the community's initial feedback on this
> > > direction
> > > > > > >  > before moving forward with the formal process.
> > > > > > >  >
> > > > > > >  > Thanks,
> > > > > > >  > Kartikey Pant
> > > > > > >  >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: Building C, IBM Hursley Office, Hursley Park Road,
> Winchester, Hampshire SO21 2JN
>

Re: [DISCUSS] FLIP-XXX: Hardening the Event Reporter with an Asynchronous Core

Reply via email to