Hi Xuyang, Thank you for reviewing the proposal!
I’m planning to use: *udf.metrics.process-time* and *udf.metrics.exception-count*. These follow the naming convention used in Flink (e.g., RocksDB native metrics <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics>). I’ve added these names to the proposal doc. Alternatively, I also considered: *metrics.udf.process-time.enabled* and *metrics.udf.exception-count.enabled. * Happy to hear any feedback on which style might be more appropriate. Best, Weiqing On Mon, Jul 14, 2025 at 2:55 AM Xuyang <xyzhong...@163.com> wrote: > Hi, Weiqing. > > Thanks for driving to improve this. I just have one question. I notice a > new configuration is introduced in this flip. I just wonder what the > configuration name is. Could you please include the full name of this > configuration? (just similar to the other names in MetricOptions?) > > > > > -- > > Best! > Xuyang > > > > > > 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com> 写道: > >Hi Alan, > > > >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE > work. > > > >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and ASYNC_TABLE. > >For async UDFs, the plan is to instrument both the invokeAsync() call and > >the async callback handler to measure the full end-to-end latency until > the > >result or error is returned from the future. > > > >Let me know if you have any further questions or suggestions. > > > >Best, > >Weiqing > > > >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg > ><asheinb...@confluent.io.invalid> wrote: > > > >> Hi Weiqing, > >> > >> From your doc, the entrypoint for UDF calls in the codegen is > >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, which > >> could be instrumented with metrics. This works well for synchronous > calls, > >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE ( > >> https://github.com/apache/flink/pull/26567)? Timing metrics would only > >> account for what it takes to call invokeAsync, not for the result to > >> complete (with a result or error from the future object). > >> > >> There are appropriate places which can handle the async callbacks, but > they > >> are in other locations. Will you be able to support those as well? > >> > >> Thanks, > >> Alan > >> > >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> wrote: > >> > >> > I just have some questions: > >> > > >> > 1. The current metrics hierarchy shows that the UDF metric group > belongs > >> to > >> > the TaskMetricGroup. I think it would be better for the UDF metric > group > >> to > >> > belong to the OperatorMetricGroup instead, because a UDF might be > used by > >> > multiple operators. > >> > 2. What are the naming conventions for UDF metrics? Could you provide > an > >> > example? Do the metric name contains the UDF name? > >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an > >> > exception, the job fails immediately. Why do we need to track this > value? > >> > > >> > Best > >> > Shengkai > >> > > >> > > >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道: > >> > > >> > > Hi all, > >> > > > >> > > I’d like to initiate a discussion about adding UDF metrics. > >> > > > >> > > *Motivation* > >> > > > >> > > User-defined functions (UDFs) are essential for custom logic in > Flink > >> > jobs > >> > > but often act as black boxes, making debugging and performance > tuning > >> > > difficult. When issues like high latency or frequent exceptions > occur, > >> > it's > >> > > hard to pinpoint the root cause inside UDFs. > >> > > > >> > > Flink currently lacks built-in metrics for key UDF aspects such as > >> > > per-record processing time or exception count. This limits > >> observability > >> > > and complicates: > >> > > > >> > > - Debugging production issues > >> > > - Performance tuning and resource allocation > >> > > - Supplying reliable signals to autoscaling systems > >> > > > >> > > Introducing standard, opt-in UDF metrics will improve platform > >> > > observability and overall health. > >> > > Here’s the proposal document: Link > >> > > < > >> > > > >> > > >> > https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 > >> > > > > >> > > > >> > > Your feedback and ideas are welcome to refine this feature. > >> > > > >> > > > >> > > Thanks, > >> > > Weiqing > >> > > > >> > > >> >