Re: Re: [DISCUSS] Add UDF Metrics

Weiqing Yang Thu, 31 Jul 2025 22:26:53 -0700

Hi Shengkai, Alan and Xuyang,

Just checking in - do you have any concerns or feedback?


If there are no further objections from anyone, I’ll mark the FLIP as ready
for voting.


Best,
Weiqing


On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <[email protected]>
wrote:

> Hi Xuyang,
>
> Thank you for reviewing the proposal!
>
> I’m planning to use: *udf.metrics.process-time* and
> *udf.metrics.exception-count*. These follow the naming convention used in
> Flink (e.g., RocksDB native metrics
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics>).
> I’ve added these names to the proposal doc.
>
> Alternatively, I also considered: *metrics.udf.process-time.enabled* and
> *metrics.udf.exception-count.enabled. *
>
> Happy to hear any feedback on which style might be more appropriate.
>
>
> Best,
> Weiqing
>
> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]> wrote:
>
>> Hi, Weiqing.
>>
>> Thanks for driving to improve this. I just have one question. I notice a
>> new configuration is introduced in this flip. I just wonder what the
>> configuration name is. Could you please include the full name of this
>> configuration? (just similar to the other names in MetricOptions?)
>>
>>
>>
>>
>> --
>>
>>     Best！
>>     Xuyang
>>
>>
>>
>>
>>
>> 在 2025-07-13 12:03:59，"Weiqing Yang" <[email protected]> 写道：
>> >Hi Alan,
>> >
>> >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE
>> work.
>> >
>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and
>> ASYNC_TABLE.
>> >For async UDFs, the plan is to instrument both the invokeAsync() call and
>> >the async callback handler to measure the full end-to-end latency until
>> the
>> >result or error is returned from the future.
>> >
>> >Let me know if you have any further questions or suggestions.
>> >
>> >Best,
>> >Weiqing
>> >
>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
>> ><[email protected]> wrote:
>> >
>> >> Hi Weiqing,
>> >>
>> >> From your doc, the entrypoint for UDF calls in the codegen is
>> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, which
>> >> could be instrumented with metrics.  This works well for synchronous
>> calls,
>> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE (
>> >> https://github.com/apache/flink/pull/26567)?  Timing metrics would
>> only
>> >> account for what it takes to call invokeAsync, not for the result to
>> >> complete (with a result or error from the future object).
>> >>
>> >> There are appropriate places which can handle the async callbacks, but
>> they
>> >> are in other locations.  Will you be able to support those as well?
>> >>
>> >> Thanks,
>> >> Alan
>> >>
>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <[email protected]>
>> wrote:
>> >>
>> >> > I just have some questions:
>> >> >
>> >> > 1. The current metrics hierarchy shows that the UDF metric group
>> belongs
>> >> to
>> >> > the TaskMetricGroup. I think it would be better for the UDF metric
>> group
>> >> to
>> >> > belong to the OperatorMetricGroup instead, because a UDF might be
>> used by
>> >> > multiple operators.
>> >> > 2. What are the naming conventions for UDF metrics? Could you
>> provide an
>> >> > example? Do the metric name contains the UDF name?
>> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an
>> >> > exception, the job fails immediately. Why do we need to track this
>> value?
>> >> >
>> >> > Best
>> >> > Shengkai
>> >> >
>> >> >
>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三 12:59写道：
>> >> >
>> >> > > Hi all,
>> >> > >
>> >> > > I’d like to initiate a discussion about adding UDF metrics.
>> >> > >
>> >> > > *Motivation*
>> >> > >
>> >> > > User-defined functions (UDFs) are essential for custom logic in
>> Flink
>> >> > jobs
>> >> > > but often act as black boxes, making debugging and performance
>> tuning
>> >> > > difficult. When issues like high latency or frequent exceptions
>> occur,
>> >> > it's
>> >> > > hard to pinpoint the root cause inside UDFs.
>> >> > >
>> >> > > Flink currently lacks built-in metrics for key UDF aspects such as
>> >> > > per-record processing time or exception count. This limits
>> >> observability
>> >> > > and complicates:
>> >> > >
>> >> > >    - Debugging production issues
>> >> > >    - Performance tuning and resource allocation
>> >> > >    - Supplying reliable signals to autoscaling systems
>> >> > >
>> >> > > Introducing standard, opt-in UDF metrics will improve platform
>> >> > > observability and overall health.
>> >> > > Here’s the proposal document: Link
>> >> > > <
>> >> > >
>> >> >
>> >>
>> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
>> >> > > >
>> >> > >
>> >> > > Your feedback and ideas are welcome to refine this feature.
>> >> > >
>> >> > >
>> >> > > Thanks,
>> >> > > Weiqing
>> >> > >
>> >> >
>> >>
>>
>

Re: Re: [DISCUSS] Add UDF Metrics

Reply via email to