Hi Aleksandr,

Thanks for the great feedback. Your points on guaranteed delivery and the
*FileEventsReporter* are spot on, and I agree with your reasoning. I'll
update the FLIP to incorporate them, as it will make the proposal much
stronger.

Regarding the delivery guarantee, I'll add a new configuration key,
*events.reporter.<name>.delivery.guarantee*, to allow a choice between two
modes. The default will be best-effort for the asynchronous, non-blocking
dispatch. I'll also add a guaranteed mode for a synchronous, blocking
dispatch that bypasses the queue, perfect for the critical autoscaling use
case you mentioned.

On your question about the *FileEventsReporter*, you're right that a local
file append is cheap. The async core isn't really designed for the
*FileEventsReporter* specifically, but for the general case where reporters
write to network sinks (e.g., *OpenTelemetry*) where latency and
backpressure are real concerns. The file reporter is just meant to be a
simple, built-in option for users.

I'll get these changes into the design doc shortly and will follow up on
this thread once it's updated. Thanks again for helping improve the FLIP.

Best,
Kartikey

On Thu, Aug 21, 2025 at 11:19 PM Aleksandr Iushmanov <izeren...@gmail.com>
wrote:

> Hi Kartikey,
>
> I like the idea and I agree with general direction, thank you for
> putting it together!
>
> I have one concern about making this modification "forced", imho there
> should be a room for "guaranteed important events delivery" from the
> operations point of view. If Flink job is struggling/backpressured it
> may make sense to emit some events at priority that would be used for
> external triggers like "autoscaling" or external dynamic configuration
> tuning.
>
> Imho, interfaces should either allow to choose "sync" vs "non guaranteed
> async" delivery for different events (or event reporters). With proposal
> "as is" it won't be possible to "ensure" that important messages have
> been delivered and can be actioned by external monitoring system. Could
> we make "queue / async" behaviour opt-in?
> Second question I had was around FileEventReporter implementation, at a
> glance, "append to file" is a fairly cheap operation, do you have a
> concern that amount of events is large enough to have significant
> bottleneck on disk IO and requires memory queue?
>
> Kind regards,
>
> Aleksandr Iushmanov
>
>
> On 2025/08/19 06:56:36 Kartikey Pant wrote:
>  > Hi everyone,
>  >
>  > I'd like to propose a new FLIP that builds directly on the excellent
>  > foundation laid by FLIP-481 (Introduce Event Reporting). For anyone
>  > needing context, the original proposal is available here:
>  >
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-481%3A+Introduce+Event+Reporting
>  >
>  > Now that the community has this powerful API, the logical next step is
>  > to ensure it's fully robust for large-scale production environments
>  > where users will be writing their own diverse, custom reporters.
>  >
>  > This proposal focuses on one key enhancement: introducing a resilient,
>  > asynchronous dispatch core. The goal is to decouple event generation
>  > from the reporter's execution, ensuring that a slow or experimental
>  > sink can never impact Flink's core stability.
>  >
>  > I've drafted a detailed design document that I hope can form the basis
>  > of this new FLIP:
>  >
>
> https://docs.google.com/document/d/1CCu7Js0ATOAgqRMS-kWj_0v0G_jt2r9IfMB2Oty7KJo/edit?usp=sharing
>  >
>  > I'm keen to get the community's initial feedback on this direction
>  > before moving forward with the formal process.
>  >
>  > Thanks,
>  > Kartikey Pant
>  >
>

Reply via email to