Re: Re: [DISCUSS] Add UDF Metrics

Weiqing Yang Wed, 13 Aug 2025 08:54:09 -0700

Hi Shengkai, Alan, Xuyang, and all,

Since there have been no further objections, I’ll proceed to start the VOTE
on this proposal shortly.


Thanks,
Weiqing

On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <[email protected]>
wrote:

> Hi Shengkai, Alan and Xuyang,
>
> Just checking in - do you have any concerns or feedback?
>
> If there are no further objections from anyone, I’ll mark the FLIP as
> ready for voting.
>
>
> Best,
> Weiqing
>
>
> On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <[email protected]>
> wrote:
>
>> Hi Xuyang,
>>
>> Thank you for reviewing the proposal!
>>
>> I’m planning to use: *udf.metrics.process-time* and
>> *udf.metrics.exception-count*. These follow the naming convention used
>> in Flink (e.g., RocksDB native metrics
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics>).
>> I’ve added these names to the proposal doc.
>>
>> Alternatively, I also considered: *metrics.udf.process-time.enabled* and
>> *metrics.udf.exception-count.enabled. *
>>
>> Happy to hear any feedback on which style might be more appropriate.
>>
>>
>> Best,
>> Weiqing
>>
>> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <[email protected]> wrote:
>>
>>> Hi, Weiqing.
>>>
>>> Thanks for driving to improve this. I just have one question. I notice a
>>> new configuration is introduced in this flip. I just wonder what the
>>> configuration name is. Could you please include the full name of this
>>> configuration? (just similar to the other names in MetricOptions?)
>>>
>>>
>>>
>>>
>>> --
>>>
>>>     Best！
>>>     Xuyang
>>>
>>>
>>>
>>>
>>>
>>> 在 2025-07-13 12:03:59，"Weiqing Yang" <[email protected]> 写道：
>>> >Hi Alan,
>>> >
>>> >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE
>>> work.
>>> >
>>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and
>>> ASYNC_TABLE.
>>> >For async UDFs, the plan is to instrument both the invokeAsync() call
>>> and
>>> >the async callback handler to measure the full end-to-end latency until
>>> the
>>> >result or error is returned from the future.
>>> >
>>> >Let me know if you have any further questions or suggestions.
>>> >
>>> >Best,
>>> >Weiqing
>>> >
>>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
>>> ><[email protected]> wrote:
>>> >
>>> >> Hi Weiqing,
>>> >>
>>> >> From your doc, the entrypoint for UDF calls in the codegen is
>>> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen,
>>> which
>>> >> could be instrumented with metrics.  This works well for synchronous
>>> calls,
>>> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE (
>>> >> https://github.com/apache/flink/pull/26567)?  Timing metrics would
>>> only
>>> >> account for what it takes to call invokeAsync, not for the result to
>>> >> complete (with a result or error from the future object).
>>> >>
>>> >> There are appropriate places which can handle the async callbacks,
>>> but they
>>> >> are in other locations.  Will you be able to support those as well?
>>> >>
>>> >> Thanks,
>>> >> Alan
>>> >>
>>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <[email protected]>
>>> wrote:
>>> >>
>>> >> > I just have some questions:
>>> >> >
>>> >> > 1. The current metrics hierarchy shows that the UDF metric group
>>> belongs
>>> >> to
>>> >> > the TaskMetricGroup. I think it would be better for the UDF metric
>>> group
>>> >> to
>>> >> > belong to the OperatorMetricGroup instead, because a UDF might be
>>> used by
>>> >> > multiple operators.
>>> >> > 2. What are the naming conventions for UDF metrics? Could you
>>> provide an
>>> >> > example? Do the metric name contains the UDF name?
>>> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws
>>> an
>>> >> > exception, the job fails immediately. Why do we need to track this
>>> value?
>>> >> >
>>> >> > Best
>>> >> > Shengkai
>>> >> >
>>> >> >
>>> >> > Weiqing Yang <[email protected]> 于2025年7月9日周三 12:59写道：
>>> >> >
>>> >> > > Hi all,
>>> >> > >
>>> >> > > I’d like to initiate a discussion about adding UDF metrics.
>>> >> > >
>>> >> > > *Motivation*
>>> >> > >
>>> >> > > User-defined functions (UDFs) are essential for custom logic in
>>> Flink
>>> >> > jobs
>>> >> > > but often act as black boxes, making debugging and performance
>>> tuning
>>> >> > > difficult. When issues like high latency or frequent exceptions
>>> occur,
>>> >> > it's
>>> >> > > hard to pinpoint the root cause inside UDFs.
>>> >> > >
>>> >> > > Flink currently lacks built-in metrics for key UDF aspects such as
>>> >> > > per-record processing time or exception count. This limits
>>> >> observability
>>> >> > > and complicates:
>>> >> > >
>>> >> > >    - Debugging production issues
>>> >> > >    - Performance tuning and resource allocation
>>> >> > >    - Supplying reliable signals to autoscaling systems
>>> >> > >
>>> >> > > Introducing standard, opt-in UDF metrics will improve platform
>>> >> > > observability and overall health.
>>> >> > > Here’s the proposal document: Link
>>> >> > > <
>>> >> > >
>>> >> >
>>> >>
>>> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
>>> >> > > >
>>> >> > >
>>> >> > > Your feedback and ideas are welcome to refine this feature.
>>> >> > >
>>> >> > >
>>> >> > > Thanks,
>>> >> > > Weiqing
>>> >> > >
>>> >> >
>>> >>
>>>
>>

Re: Re: [DISCUSS] Add UDF Metrics

Reply via email to