Hi Shengkai, Alan, Xuyang, and all, Since there have been no further objections, I’ll proceed to start the VOTE on this proposal shortly.
Thanks, Weiqing On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <yangweiqing...@gmail.com> wrote: > Hi Shengkai, Alan and Xuyang, > > Just checking in - do you have any concerns or feedback? > > If there are no further objections from anyone, I’ll mark the FLIP as > ready for voting. > > > Best, > Weiqing > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <yangweiqing...@gmail.com> > wrote: > >> Hi Xuyang, >> >> Thank you for reviewing the proposal! >> >> I’m planning to use: *udf.metrics.process-time* and >> *udf.metrics.exception-count*. These follow the naming convention used >> in Flink (e.g., RocksDB native metrics >> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics>). >> I’ve added these names to the proposal doc. >> >> Alternatively, I also considered: *metrics.udf.process-time.enabled* and >> *metrics.udf.exception-count.enabled. * >> >> Happy to hear any feedback on which style might be more appropriate. >> >> >> Best, >> Weiqing >> >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <xyzhong...@163.com> wrote: >> >>> Hi, Weiqing. >>> >>> Thanks for driving to improve this. I just have one question. I notice a >>> new configuration is introduced in this flip. I just wonder what the >>> configuration name is. Could you please include the full name of this >>> configuration? (just similar to the other names in MetricOptions?) >>> >>> >>> >>> >>> -- >>> >>> Best! >>> Xuyang >>> >>> >>> >>> >>> >>> 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com> 写道: >>> >Hi Alan, >>> > >>> >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE >>> work. >>> > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and >>> ASYNC_TABLE. >>> >For async UDFs, the plan is to instrument both the invokeAsync() call >>> and >>> >the async callback handler to measure the full end-to-end latency until >>> the >>> >result or error is returned from the future. >>> > >>> >Let me know if you have any further questions or suggestions. >>> > >>> >Best, >>> >Weiqing >>> > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg >>> ><asheinb...@confluent.io.invalid> wrote: >>> > >>> >> Hi Weiqing, >>> >> >>> >> From your doc, the entrypoint for UDF calls in the codegen is >>> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, >>> which >>> >> could be instrumented with metrics. This works well for synchronous >>> calls, >>> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE ( >>> >> https://github.com/apache/flink/pull/26567)? Timing metrics would >>> only >>> >> account for what it takes to call invokeAsync, not for the result to >>> >> complete (with a result or error from the future object). >>> >> >>> >> There are appropriate places which can handle the async callbacks, >>> but they >>> >> are in other locations. Will you be able to support those as well? >>> >> >>> >> Thanks, >>> >> Alan >>> >> >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> >>> wrote: >>> >> >>> >> > I just have some questions: >>> >> > >>> >> > 1. The current metrics hierarchy shows that the UDF metric group >>> belongs >>> >> to >>> >> > the TaskMetricGroup. I think it would be better for the UDF metric >>> group >>> >> to >>> >> > belong to the OperatorMetricGroup instead, because a UDF might be >>> used by >>> >> > multiple operators. >>> >> > 2. What are the naming conventions for UDF metrics? Could you >>> provide an >>> >> > example? Do the metric name contains the UDF name? >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws >>> an >>> >> > exception, the job fails immediately. Why do we need to track this >>> value? >>> >> > >>> >> > Best >>> >> > Shengkai >>> >> > >>> >> > >>> >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道: >>> >> > >>> >> > > Hi all, >>> >> > > >>> >> > > I’d like to initiate a discussion about adding UDF metrics. >>> >> > > >>> >> > > *Motivation* >>> >> > > >>> >> > > User-defined functions (UDFs) are essential for custom logic in >>> Flink >>> >> > jobs >>> >> > > but often act as black boxes, making debugging and performance >>> tuning >>> >> > > difficult. When issues like high latency or frequent exceptions >>> occur, >>> >> > it's >>> >> > > hard to pinpoint the root cause inside UDFs. >>> >> > > >>> >> > > Flink currently lacks built-in metrics for key UDF aspects such as >>> >> > > per-record processing time or exception count. This limits >>> >> observability >>> >> > > and complicates: >>> >> > > >>> >> > > - Debugging production issues >>> >> > > - Performance tuning and resource allocation >>> >> > > - Supplying reliable signals to autoscaling systems >>> >> > > >>> >> > > Introducing standard, opt-in UDF metrics will improve platform >>> >> > > observability and overall health. >>> >> > > Here’s the proposal document: Link >>> >> > > < >>> >> > > >>> >> > >>> >> >>> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 >>> >> > > > >>> >> > > >>> >> > > Your feedback and ideas are welcome to refine this feature. >>> >> > > >>> >> > > >>> >> > > Thanks, >>> >> > > Weiqing >>> >> > > >>> >> > >>> >> >>> >>