Hi Shengkai, Alan and Xuyang, Just checking in - do you have any concerns or feedback?
If there are no further objections from anyone, I’ll mark the FLIP as ready for voting. Best, Weiqing On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <yangweiqing...@gmail.com> wrote: > Hi Xuyang, > > Thank you for reviewing the proposal! > > I’m planning to use: *udf.metrics.process-time* and > *udf.metrics.exception-count*. These follow the naming convention used in > Flink (e.g., RocksDB native metrics > <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics>). > I’ve added these names to the proposal doc. > > Alternatively, I also considered: *metrics.udf.process-time.enabled* and > *metrics.udf.exception-count.enabled. * > > Happy to hear any feedback on which style might be more appropriate. > > > Best, > Weiqing > > On Mon, Jul 14, 2025 at 2:55 AM Xuyang <xyzhong...@163.com> wrote: > >> Hi, Weiqing. >> >> Thanks for driving to improve this. I just have one question. I notice a >> new configuration is introduced in this flip. I just wonder what the >> configuration name is. Could you please include the full name of this >> configuration? (just similar to the other names in MetricOptions?) >> >> >> >> >> -- >> >> Best! >> Xuyang >> >> >> >> >> >> 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com> 写道: >> >Hi Alan, >> > >> >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE >> work. >> > >> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and >> ASYNC_TABLE. >> >For async UDFs, the plan is to instrument both the invokeAsync() call and >> >the async callback handler to measure the full end-to-end latency until >> the >> >result or error is returned from the future. >> > >> >Let me know if you have any further questions or suggestions. >> > >> >Best, >> >Weiqing >> > >> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg >> ><asheinb...@confluent.io.invalid> wrote: >> > >> >> Hi Weiqing, >> >> >> >> From your doc, the entrypoint for UDF calls in the codegen is >> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, which >> >> could be instrumented with metrics. This works well for synchronous >> calls, >> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE ( >> >> https://github.com/apache/flink/pull/26567)? Timing metrics would >> only >> >> account for what it takes to call invokeAsync, not for the result to >> >> complete (with a result or error from the future object). >> >> >> >> There are appropriate places which can handle the async callbacks, but >> they >> >> are in other locations. Will you be able to support those as well? >> >> >> >> Thanks, >> >> Alan >> >> >> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> >> wrote: >> >> >> >> > I just have some questions: >> >> > >> >> > 1. The current metrics hierarchy shows that the UDF metric group >> belongs >> >> to >> >> > the TaskMetricGroup. I think it would be better for the UDF metric >> group >> >> to >> >> > belong to the OperatorMetricGroup instead, because a UDF might be >> used by >> >> > multiple operators. >> >> > 2. What are the naming conventions for UDF metrics? Could you >> provide an >> >> > example? Do the metric name contains the UDF name? >> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an >> >> > exception, the job fails immediately. Why do we need to track this >> value? >> >> > >> >> > Best >> >> > Shengkai >> >> > >> >> > >> >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道: >> >> > >> >> > > Hi all, >> >> > > >> >> > > I’d like to initiate a discussion about adding UDF metrics. >> >> > > >> >> > > *Motivation* >> >> > > >> >> > > User-defined functions (UDFs) are essential for custom logic in >> Flink >> >> > jobs >> >> > > but often act as black boxes, making debugging and performance >> tuning >> >> > > difficult. When issues like high latency or frequent exceptions >> occur, >> >> > it's >> >> > > hard to pinpoint the root cause inside UDFs. >> >> > > >> >> > > Flink currently lacks built-in metrics for key UDF aspects such as >> >> > > per-record processing time or exception count. This limits >> >> observability >> >> > > and complicates: >> >> > > >> >> > > - Debugging production issues >> >> > > - Performance tuning and resource allocation >> >> > > - Supplying reliable signals to autoscaling systems >> >> > > >> >> > > Introducing standard, opt-in UDF metrics will improve platform >> >> > > observability and overall health. >> >> > > Here’s the proposal document: Link >> >> > > < >> >> > > >> >> > >> >> >> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 >> >> > > > >> >> > > >> >> > > Your feedback and ideas are welcome to refine this feature. >> >> > > >> >> > > >> >> > > Thanks, >> >> > > Weiqing >> >> > > >> >> > >> >> >> >