Hi Xuyang,

Thank you for reviewing the proposal!

I’m planning to use: *udf.metrics.process-time* and
*udf.metrics.exception-count*. These follow the naming convention used in
Flink (e.g., RocksDB native metrics
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics>).
I’ve added these names to the proposal doc.

Alternatively, I also considered: *metrics.udf.process-time.enabled* and
*metrics.udf.exception-count.enabled. *

Happy to hear any feedback on which style might be more appropriate.


Best,
Weiqing

On Mon, Jul 14, 2025 at 2:55 AM Xuyang <xyzhong...@163.com> wrote:

> Hi, Weiqing.
>
> Thanks for driving to improve this. I just have one question. I notice a
> new configuration is introduced in this flip. I just wonder what the
> configuration name is. Could you please include the full name of this
> configuration? (just similar to the other names in MetricOptions?)
>
>
>
>
> --
>
>     Best!
>     Xuyang
>
>
>
>
>
> 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com> 写道:
> >Hi Alan,
> >
> >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE
> work.
> >
> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and ASYNC_TABLE.
> >For async UDFs, the plan is to instrument both the invokeAsync() call and
> >the async callback handler to measure the full end-to-end latency until
> the
> >result or error is returned from the future.
> >
> >Let me know if you have any further questions or suggestions.
> >
> >Best,
> >Weiqing
> >
> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
> ><asheinb...@confluent.io.invalid> wrote:
> >
> >> Hi Weiqing,
> >>
> >> From your doc, the entrypoint for UDF calls in the codegen is
> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen, which
> >> could be instrumented with metrics.  This works well for synchronous
> calls,
> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE (
> >> https://github.com/apache/flink/pull/26567)?  Timing metrics would only
> >> account for what it takes to call invokeAsync, not for the result to
> >> complete (with a result or error from the future object).
> >>
> >> There are appropriate places which can handle the async callbacks, but
> they
> >> are in other locations.  Will you be able to support those as well?
> >>
> >> Thanks,
> >> Alan
> >>
> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com> wrote:
> >>
> >> > I just have some questions:
> >> >
> >> > 1. The current metrics hierarchy shows that the UDF metric group
> belongs
> >> to
> >> > the TaskMetricGroup. I think it would be better for the UDF metric
> group
> >> to
> >> > belong to the OperatorMetricGroup instead, because a UDF might be
> used by
> >> > multiple operators.
> >> > 2. What are the naming conventions for UDF metrics? Could you provide
> an
> >> > example? Do the metric name contains the UDF name?
> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws an
> >> > exception, the job fails immediately. Why do we need to track this
> value?
> >> >
> >> > Best
> >> > Shengkai
> >> >
> >> >
> >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道:
> >> >
> >> > > Hi all,
> >> > >
> >> > > I’d like to initiate a discussion about adding UDF metrics.
> >> > >
> >> > > *Motivation*
> >> > >
> >> > > User-defined functions (UDFs) are essential for custom logic in
> Flink
> >> > jobs
> >> > > but often act as black boxes, making debugging and performance
> tuning
> >> > > difficult. When issues like high latency or frequent exceptions
> occur,
> >> > it's
> >> > > hard to pinpoint the root cause inside UDFs.
> >> > >
> >> > > Flink currently lacks built-in metrics for key UDF aspects such as
> >> > > per-record processing time or exception count. This limits
> >> observability
> >> > > and complicates:
> >> > >
> >> > >    - Debugging production issues
> >> > >    - Performance tuning and resource allocation
> >> > >    - Supplying reliable signals to autoscaling systems
> >> > >
> >> > > Introducing standard, opt-in UDF metrics will improve platform
> >> > > observability and overall health.
> >> > > Here’s the proposal document: Link
> >> > > <
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> >> > > >
> >> > >
> >> > > Your feedback and ideas are welcome to refine this feature.
> >> > >
> >> > >
> >> > > Thanks,
> >> > > Weiqing
> >> > >
> >> >
> >>
>

Reply via email to