I don't have any more comments.

Best,
Shengkai

Weiqing Yang <yangweiqing...@gmail.com> 于2025年8月14日周四 14:47写道:

> Thanks, Shengkai. I’ve updated the proposal doc with the recommended
> configuration name. Please let me know if you have any additional feedback.
>
> Best,
> Weiqing
>
> On Wed, Aug 13, 2025 at 6:58 PM Shengkai Fang <fskm...@gmail.com> wrote:
>
> > Sorry for the late response. I prefer to use
> > `table.exec.udf-metric-enabled` as the option name.
> >
> > Best,
> > Shengkai
> >
> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年8月13日周三 23:54写道:
> >
> > > Hi Shengkai, Alan, Xuyang, and all,
> > >
> > > Since there have been no further objections, I’ll proceed to start the
> > VOTE
> > > on this proposal shortly.
> > >
> > > Thanks,
> > > Weiqing
> > >
> > > On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <
> yangweiqing...@gmail.com>
> > > wrote:
> > >
> > > > Hi Shengkai, Alan and Xuyang,
> > > >
> > > > Just checking in - do you have any concerns or feedback?
> > > >
> > > > If there are no further objections from anyone, I’ll mark the FLIP as
> > > > ready for voting.
> > > >
> > > >
> > > > Best,
> > > > Weiqing
> > > >
> > > >
> > > > On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <
> yangweiqing...@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> Hi Xuyang,
> > > >>
> > > >> Thank you for reviewing the proposal!
> > > >>
> > > >> I’m planning to use: *udf.metrics.process-time* and
> > > >> *udf.metrics.exception-count*. These follow the naming convention
> used
> > > >> in Flink (e.g., RocksDB native metrics
> > > >> <
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics
> > > >).
> > > >> I’ve added these names to the proposal doc.
> > > >>
> > > >> Alternatively, I also considered: *metrics.udf.process-time.enabled*
> > and
> > > >> *metrics.udf.exception-count.enabled. *
> > > >>
> > > >> Happy to hear any feedback on which style might be more appropriate.
> > > >>
> > > >>
> > > >> Best,
> > > >> Weiqing
> > > >>
> > > >> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <xyzhong...@163.com> wrote:
> > > >>
> > > >>> Hi, Weiqing.
> > > >>>
> > > >>> Thanks for driving to improve this. I just have one question. I
> > notice
> > > a
> > > >>> new configuration is introduced in this flip. I just wonder what
> the
> > > >>> configuration name is. Could you please include the full name of
> this
> > > >>> configuration? (just similar to the other names in MetricOptions?)
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>>
> > > >>>     Best!
> > > >>>     Xuyang
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com>
> 写道:
> > > >>> >Hi Alan,
> > > >>> >
> > > >>> >Thanks for reviewing the proposal and for highlighting the
> > ASYNC_TABLE
> > > >>> work.
> > > >>> >
> > > >>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and
> > > >>> ASYNC_TABLE.
> > > >>> >For async UDFs, the plan is to instrument both the invokeAsync()
> > call
> > > >>> and
> > > >>> >the async callback handler to measure the full end-to-end latency
> > > until
> > > >>> the
> > > >>> >result or error is returned from the future.
> > > >>> >
> > > >>> >Let me know if you have any further questions or suggestions.
> > > >>> >
> > > >>> >Best,
> > > >>> >Weiqing
> > > >>> >
> > > >>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
> > > >>> ><asheinb...@confluent.io.invalid> wrote:
> > > >>> >
> > > >>> >> Hi Weiqing,
> > > >>> >>
> > > >>> >> From your doc, the entrypoint for UDF calls in the codegen is
> > > >>> >> ExprCodeGenerator which should invoke
> BridgingSqlFunctionCallGen,
> > > >>> which
> > > >>> >> could be instrumented with metrics.  This works well for
> > synchronous
> > > >>> calls,
> > > >>> >> but what about ASYNC_SCALAR and the soon to be merged
> ASYNC_TABLE
> > (
> > > >>> >> https://github.com/apache/flink/pull/26567)?  Timing metrics
> > would
> > > >>> only
> > > >>> >> account for what it takes to call invokeAsync, not for the
> result
> > to
> > > >>> >> complete (with a result or error from the future object).
> > > >>> >>
> > > >>> >> There are appropriate places which can handle the async
> callbacks,
> > > >>> but they
> > > >>> >> are in other locations.  Will you be able to support those as
> > well?
> > > >>> >>
> > > >>> >> Thanks,
> > > >>> >> Alan
> > > >>> >>
> > > >>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com
> >
> > > >>> wrote:
> > > >>> >>
> > > >>> >> > I just have some questions:
> > > >>> >> >
> > > >>> >> > 1. The current metrics hierarchy shows that the UDF metric
> group
> > > >>> belongs
> > > >>> >> to
> > > >>> >> > the TaskMetricGroup. I think it would be better for the UDF
> > metric
> > > >>> group
> > > >>> >> to
> > > >>> >> > belong to the OperatorMetricGroup instead, because a UDF might
> > be
> > > >>> used by
> > > >>> >> > multiple operators.
> > > >>> >> > 2. What are the naming conventions for UDF metrics? Could you
> > > >>> provide an
> > > >>> >> > example? Do the metric name contains the UDF name?
> > > >>> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF
> > throws
> > > >>> an
> > > >>> >> > exception, the job fails immediately. Why do we need to track
> > this
> > > >>> value?
> > > >>> >> >
> > > >>> >> > Best
> > > >>> >> > Shengkai
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道:
> > > >>> >> >
> > > >>> >> > > Hi all,
> > > >>> >> > >
> > > >>> >> > > I’d like to initiate a discussion about adding UDF metrics.
> > > >>> >> > >
> > > >>> >> > > *Motivation*
> > > >>> >> > >
> > > >>> >> > > User-defined functions (UDFs) are essential for custom logic
> > in
> > > >>> Flink
> > > >>> >> > jobs
> > > >>> >> > > but often act as black boxes, making debugging and
> performance
> > > >>> tuning
> > > >>> >> > > difficult. When issues like high latency or frequent
> > exceptions
> > > >>> occur,
> > > >>> >> > it's
> > > >>> >> > > hard to pinpoint the root cause inside UDFs.
> > > >>> >> > >
> > > >>> >> > > Flink currently lacks built-in metrics for key UDF aspects
> > such
> > > as
> > > >>> >> > > per-record processing time or exception count. This limits
> > > >>> >> observability
> > > >>> >> > > and complicates:
> > > >>> >> > >
> > > >>> >> > >    - Debugging production issues
> > > >>> >> > >    - Performance tuning and resource allocation
> > > >>> >> > >    - Supplying reliable signals to autoscaling systems
> > > >>> >> > >
> > > >>> >> > > Introducing standard, opt-in UDF metrics will improve
> platform
> > > >>> >> > > observability and overall health.
> > > >>> >> > > Here’s the proposal document: Link
> > > >>> >> > > <
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
> > > >>> >> > > >
> > > >>> >> > >
> > > >>> >> > > Your feedback and ideas are welcome to refine this feature.
> > > >>> >> > >
> > > >>> >> > >
> > > >>> >> > > Thanks,
> > > >>> >> > > Weiqing
> > > >>> >> > >
> > > >>> >> >
> > > >>> >>
> > > >>>
> > > >>
> > >
> >
>

Reply via email to