Re: Re: [DISCUSS] Add UDF Metrics

Shengkai Fang Wed, 13 Aug 2025 18:58:18 -0700

Sorry for the late response. I prefer to use
`table.exec.udf-metric-enabled` as the option name.


Best,
Shengkai

Weiqing Yang <[email protected]> 于2025年8月13日周三 23:54写道：

> Hi Shengkai, Alan, Xuyang, and all,
>
> Since there have been no further objections, I’ll proceed to start the VOTE
> on this proposal shortly.
>
> Thanks,
> Weiqing
>
> On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <[email protected]>
> wrote:
>
> > Hi Shengkai, Alan and Xuyang,
> >
> > Just checking in - do you have any concerns or feedback?
> >
> > If there are no further objections from anyone, I’ll mark the FLIP as
> > ready for voting.
> >
> >
> > Best,
> > Weiqing
> >
> >
> > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <[email protected]>
> > wrote:
> >
> >> Hi Xuyang,
> >>
> >> Thank you for reviewing the proposal!
> >>
> >> I’m planning to use: *udf.metrics.process-time* and
> >> *udf.metrics.exception-count*. These follow the naming convention used
> >> in Flink (e.g., RocksDB native metrics
> >> <
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics
> >).
> >> I’ve added these names to the proposal doc.
> >>
> >> Alternatively, I also considered: *metrics.udf.process-time.enabled* and
> >> *metrics.udf.exception-count.enabled. *
> >>
> >> Happy to hear any feedback on which style might be more appropriate.
> >>
> >>
> >> Best,
> >> Weiqing
> >>
> >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]> wrote:
> >>
> >>> Hi, Weiqing.
> >>>
> >>> Thanks for driving to improve this. I just have one question. I notice
> a
> >>> new configuration is introduced in this flip. I just wonder what the
> >>> configuration name is. Could you please include the full name of this
> >>> configuration? (just similar to the other names in MetricOptions?)
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>>     Best！
> >>>     Xuyang
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 在 2025-07-13 12:03:59，"Weiqing Yang" <[email protected]> 写道：
> >>> >Hi Alan,
> >>> >
> >>> >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE
> >>> work.
> >>> >
> >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and
> >>> ASYNC_TABLE.
> >>> >For async UDFs, the plan is to instrument both the invokeAsync() call
> >>> and
> >>> >the async callback handler to measure the full end-to-end latency
> until
> >>> the
> >>> >result or error is returned from the future.
> >>> >
> >>> >Let me know if you have any further questions or suggestions.
> >>> >
> >>> >Best,
> >>> >Weiqing
> >>> >
> >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
> >>> ><[email protected]> wrote:
> >>> >
> >>> >> Hi Weiqing,
> >>> >>
> >>> >> From your doc, the entrypoint for UDF calls in the codegen is
> >>> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen,
> >>> which
> >>> >> could be instrumented with metrics.  This works well for synchronous
> >>> calls,
> >>> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE (
> >>> >> https://github.com/apache/flink/pull/26567)?  Timing metrics would
> >>> only
> >>> >> account for what it takes to call invokeAsync, not for the result to
> >>> >> complete (with a result or error from the future object).
> >>> >>
> >>> >> There are appropriate places which can handle the async callbacks,
> >>> but they
> >>> >> are in other locations.  Will you be able to support those as well?
> >>> >>
> >>> >> Thanks,
> >>> >> Alan
> >>> >>
> >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <[email protected]>
> >>> wrote:
> >>> >>
> >>> >> > I just have some questions:
> >>> >> >
> >>> >> > 1. The current metrics hierarchy shows that the UDF metric group
> >>> belongs
> >>> >> to
> >>> >> > the TaskMetricGroup. I think it would be better for the UDF metric
> >>> group
> >>> >> to
> >>> >> > belong to the OperatorMetricGroup instead, because a UDF might be
> >>> used by
> >>> >> > multiple operators.
> >>> >> > 2. What are the naming conventions for UDF metrics? Could you
> >>> provide an
> >>> >> > example? Do the metric name contains the UDF name?
> >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws
> >>> an
> >>> >> > exception, the job fails immediately. Why do we need to track this
> >>> value?
> >>> >> >
> >>> >> > Best
> >>> >> > Shengkai
> >>> >> >
> >>> >> >
> >>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三 12:59写道：
> >>> >> >
> >>> >> > > Hi all,
> >>> >> > >
> >>> >> > > I’d like to initiate a discussion about adding UDF metrics.
> >>> >> > >
> >>> >> > > *Motivation*
> >>> >> > >
> >>> >> > > User-defined functions (UDFs) are essential for custom logic in
> >>> Flink
> >>> >> > jobs
> >>> >> > > but often act as black boxes, making debugging and performance
> >>> tuning
> >>> >> > > difficult. When issues like high latency or frequent exceptions
> >>> occur,
> >>> >> > it's
> >>> >> > > hard to pinpoint the root cause inside UDFs.
> >>> >> > >
> >>> >> > > Flink currently lacks built-in metrics for key UDF aspects such
> as
> >>> >> > > per-record processing time or exception count. This limits
> >>> >> observability
> >>> >> > > and complicates:
> >>> >> > >
> >>> >> > >    - Debugging production issues
> >>> >> > >    - Performance tuning and resource allocation
> >>> >> > >    - Supplying reliable signals to autoscaling systems
> >>> >> > >
> >>> >> > > Introducing standard, opt-in UDF metrics will improve platform
> >>> >> > > observability and overall health.
> >>> >> > > Here’s the proposal document: Link
> >>> >> > > <
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> >>> >> > > >
> >>> >> > >
> >>> >> > > Your feedback and ideas are welcome to refine this feature.
> >>> >> > >
> >>> >> > >
> >>> >> > > Thanks,
> >>> >> > > Weiqing
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> >>
>

Re: Re: [DISCUSS] Add UDF Metrics

Reply via email to