Thanks, Shengkai. I’ve updated the proposal doc with the recommended configuration name. Please let me know if you have any additional feedback.
Best, Weiqing On Wed, Aug 13, 2025 at 6:58 PM Shengkai Fang <fskm...@gmail.com> wrote: > Sorry for the late response. I prefer to use > `table.exec.udf-metric-enabled` as the option name. > > Best, > Shengkai > > Weiqing Yang <yangweiqing...@gmail.com> 于2025年8月13日周三 23:54写道: > > > Hi Shengkai, Alan, Xuyang, and all, > > > > Since there have been no further objections, I’ll proceed to start the > VOTE > > on this proposal shortly. > > > > Thanks, > > Weiqing > > > > On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <yangweiqing...@gmail.com> > > wrote: > > > > > Hi Shengkai, Alan and Xuyang, > > > > > > Just checking in - do you have any concerns or feedback? > > > > > > If there are no further objections from anyone, I’ll mark the FLIP as > > > ready for voting. > > > > > > > > > Best, > > > Weiqing > > > > > > > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <yangweiqing...@gmail.com > > > > > wrote: > > > > > >> Hi Xuyang, > > >> > > >> Thank you for reviewing the proposal! > > >> > > >> I’m planning to use: *udf.metrics.process-time* and > > >> *udf.metrics.exception-count*. These follow the naming convention used > > >> in Flink (e.g., RocksDB native metrics > > >> < > > > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics > > >). > > >> I’ve added these names to the proposal doc. > > >> > > >> Alternatively, I also considered: *metrics.udf.process-time.enabled* > and > > >> *metrics.udf.exception-count.enabled. * > > >> > > >> Happy to hear any feedback on which style might be more appropriate. > > >> > > >> > > >> Best, > > >> Weiqing > > >> > > >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <xyzhong...@163.com> wrote: > > >> > > >>> Hi, Weiqing. > > >>> > > >>> Thanks for driving to improve this. I just have one question. I > notice > > a > > >>> new configuration is introduced in this flip. I just wonder what the > > >>> configuration name is. Could you please include the full name of this > > >>> configuration? (just similar to the other names in MetricOptions?) > > >>> > > >>> > > >>> > > >>> > > >>> -- > > >>> > > >>> Best! > > >>> Xuyang > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com> 写道: > > >>> >Hi Alan, > > >>> > > > >>> >Thanks for reviewing the proposal and for highlighting the > ASYNC_TABLE > > >>> work. > > >>> > > > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and > > >>> ASYNC_TABLE. > > >>> >For async UDFs, the plan is to instrument both the invokeAsync() > call > > >>> and > > >>> >the async callback handler to measure the full end-to-end latency > > until > > >>> the > > >>> >result or error is returned from the future. > > >>> > > > >>> >Let me know if you have any further questions or suggestions. > > >>> > > > >>> >Best, > > >>> >Weiqing > > >>> > > > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg > > >>> ><asheinb...@confluent.io.invalid> wrote: > > >>> > > > >>> >> Hi Weiqing, > > >>> >> > > >>> >> From your doc, the entrypoint for UDF calls in the codegen is > > >>> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, > > >>> which > > >>> >> could be instrumented with metrics. This works well for > synchronous > > >>> calls, > > >>> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE > ( > > >>> >> https://github.com/apache/flink/pull/26567)? Timing metrics > would > > >>> only > > >>> >> account for what it takes to call invokeAsync, not for the result > to > > >>> >> complete (with a result or error from the future object). > > >>> >> > > >>> >> There are appropriate places which can handle the async callbacks, > > >>> but they > > >>> >> are in other locations. Will you be able to support those as > well? > > >>> >> > > >>> >> Thanks, > > >>> >> Alan > > >>> >> > > >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> > > >>> wrote: > > >>> >> > > >>> >> > I just have some questions: > > >>> >> > > > >>> >> > 1. The current metrics hierarchy shows that the UDF metric group > > >>> belongs > > >>> >> to > > >>> >> > the TaskMetricGroup. I think it would be better for the UDF > metric > > >>> group > > >>> >> to > > >>> >> > belong to the OperatorMetricGroup instead, because a UDF might > be > > >>> used by > > >>> >> > multiple operators. > > >>> >> > 2. What are the naming conventions for UDF metrics? Could you > > >>> provide an > > >>> >> > example? Do the metric name contains the UDF name? > > >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF > throws > > >>> an > > >>> >> > exception, the job fails immediately. Why do we need to track > this > > >>> value? > > >>> >> > > > >>> >> > Best > > >>> >> > Shengkai > > >>> >> > > > >>> >> > > > >>> >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道: > > >>> >> > > > >>> >> > > Hi all, > > >>> >> > > > > >>> >> > > I’d like to initiate a discussion about adding UDF metrics. > > >>> >> > > > > >>> >> > > *Motivation* > > >>> >> > > > > >>> >> > > User-defined functions (UDFs) are essential for custom logic > in > > >>> Flink > > >>> >> > jobs > > >>> >> > > but often act as black boxes, making debugging and performance > > >>> tuning > > >>> >> > > difficult. When issues like high latency or frequent > exceptions > > >>> occur, > > >>> >> > it's > > >>> >> > > hard to pinpoint the root cause inside UDFs. > > >>> >> > > > > >>> >> > > Flink currently lacks built-in metrics for key UDF aspects > such > > as > > >>> >> > > per-record processing time or exception count. This limits > > >>> >> observability > > >>> >> > > and complicates: > > >>> >> > > > > >>> >> > > - Debugging production issues > > >>> >> > > - Performance tuning and resource allocation > > >>> >> > > - Supplying reliable signals to autoscaling systems > > >>> >> > > > > >>> >> > > Introducing standard, opt-in UDF metrics will improve platform > > >>> >> > > observability and overall health. > > >>> >> > > Here’s the proposal document: Link > > >>> >> > > < > > >>> >> > > > > >>> >> > > > >>> >> > > >>> > > > https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1 > > >>> >> > > > > > >>> >> > > > > >>> >> > > Your feedback and ideas are welcome to refine this feature. > > >>> >> > > > > >>> >> > > > > >>> >> > > Thanks, > > >>> >> > > Weiqing > > >>> >> > > > > >>> >> > > > >>> >> > > >>> > > >> > > >