Hi Shengkai, Alan, Xuyang, and all,

Since there have been no further objections, I’ll proceed to start the VOTE
on this proposal shortly.

Thanks,
Weiqing

On Thu, Jul 31, 2025 at 10:26 PM Weiqing Yang <yangweiqing...@gmail.com>
wrote:

> Hi Shengkai, Alan and Xuyang,
>
> Just checking in - do you have any concerns or feedback?
>
> If there are no further objections from anyone, I’ll mark the FLIP as
> ready for voting.
>
>
> Best,
> Weiqing
>
>
> On Mon, Jul 14, 2025 at 9:10 PM Weiqing Yang <yangweiqing...@gmail.com>
> wrote:
>
>> Hi Xuyang,
>>
>> Thank you for reviewing the proposal!
>>
>> I’m planning to use: *udf.metrics.process-time* and
>> *udf.metrics.exception-count*. These follow the naming convention used
>> in Flink (e.g., RocksDB native metrics
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics>).
>> I’ve added these names to the proposal doc.
>>
>> Alternatively, I also considered: *metrics.udf.process-time.enabled* and
>> *metrics.udf.exception-count.enabled. *
>>
>> Happy to hear any feedback on which style might be more appropriate.
>>
>>
>> Best,
>> Weiqing
>>
>> On Mon, Jul 14, 2025 at 2:55 AM Xuyang <xyzhong...@163.com> wrote:
>>
>>> Hi, Weiqing.
>>>
>>> Thanks for driving to improve this. I just have one question. I notice a
>>> new configuration is introduced in this flip. I just wonder what the
>>> configuration name is. Could you please include the full name of this
>>> configuration? (just similar to the other names in MetricOptions?)
>>>
>>>
>>>
>>>
>>> --
>>>
>>>     Best!
>>>     Xuyang
>>>
>>>
>>>
>>>
>>>
>>> 在 2025-07-13 12:03:59,"Weiqing Yang" <yangweiqing...@gmail.com> 写道:
>>> >Hi Alan,
>>> >
>>> >Thanks for reviewing the proposal and for highlighting the ASYNC_TABLE
>>> work.
>>> >
>>> >Yes, I’ve updated the proposal to cover both ASYNC_SCALAR and
>>> ASYNC_TABLE.
>>> >For async UDFs, the plan is to instrument both the invokeAsync() call
>>> and
>>> >the async callback handler to measure the full end-to-end latency until
>>> the
>>> >result or error is returned from the future.
>>> >
>>> >Let me know if you have any further questions or suggestions.
>>> >
>>> >Best,
>>> >Weiqing
>>> >
>>> >On Thu, Jul 10, 2025 at 4:15 PM Alan Sheinberg
>>> ><asheinb...@confluent.io.invalid> wrote:
>>> >
>>> >> Hi Weiqing,
>>> >>
>>> >> From your doc, the entrypoint for UDF calls in the codegen is
>>> >> ExprCodeGenerator which should invoke BridgingSqlFunctionCallGen,
>>> which
>>> >> could be instrumented with metrics.  This works well for synchronous
>>> calls,
>>> >> but what about ASYNC_SCALAR and the soon to be merged ASYNC_TABLE (
>>> >> https://github.com/apache/flink/pull/26567)?  Timing metrics would
>>> only
>>> >> account for what it takes to call invokeAsync, not for the result to
>>> >> complete (with a result or error from the future object).
>>> >>
>>> >> There are appropriate places which can handle the async callbacks,
>>> but they
>>> >> are in other locations.  Will you be able to support those as well?
>>> >>
>>> >> Thanks,
>>> >> Alan
>>> >>
>>> >> On Wed, Jul 9, 2025 at 7:52 PM Shengkai Fang <fskm...@gmail.com>
>>> wrote:
>>> >>
>>> >> > I just have some questions:
>>> >> >
>>> >> > 1. The current metrics hierarchy shows that the UDF metric group
>>> belongs
>>> >> to
>>> >> > the TaskMetricGroup. I think it would be better for the UDF metric
>>> group
>>> >> to
>>> >> > belong to the OperatorMetricGroup instead, because a UDF might be
>>> used by
>>> >> > multiple operators.
>>> >> > 2. What are the naming conventions for UDF metrics? Could you
>>> provide an
>>> >> > example? Do the metric name contains the UDF name?
>>> >> > 3. Why is the UDFExceptionCount metric introduced? If a UDF throws
>>> an
>>> >> > exception, the job fails immediately. Why do we need to track this
>>> value?
>>> >> >
>>> >> > Best
>>> >> > Shengkai
>>> >> >
>>> >> >
>>> >> > Weiqing Yang <yangweiqing...@gmail.com> 于2025年7月9日周三 12:59写道:
>>> >> >
>>> >> > > Hi all,
>>> >> > >
>>> >> > > I’d like to initiate a discussion about adding UDF metrics.
>>> >> > >
>>> >> > > *Motivation*
>>> >> > >
>>> >> > > User-defined functions (UDFs) are essential for custom logic in
>>> Flink
>>> >> > jobs
>>> >> > > but often act as black boxes, making debugging and performance
>>> tuning
>>> >> > > difficult. When issues like high latency or frequent exceptions
>>> occur,
>>> >> > it's
>>> >> > > hard to pinpoint the root cause inside UDFs.
>>> >> > >
>>> >> > > Flink currently lacks built-in metrics for key UDF aspects such as
>>> >> > > per-record processing time or exception count. This limits
>>> >> observability
>>> >> > > and complicates:
>>> >> > >
>>> >> > >    - Debugging production issues
>>> >> > >    - Performance tuning and resource allocation
>>> >> > >    - Supplying reliable signals to autoscaling systems
>>> >> > >
>>> >> > > Introducing standard, opt-in UDF metrics will improve platform
>>> >> > > observability and overall health.
>>> >> > > Here’s the proposal document: Link
>>> >> > > <
>>> >> > >
>>> >> >
>>> >>
>>> https://docs.google.com/document/d/1ZTN_kSxTMXKyJcrtmP6I9wlZmfPkK8748_nA6EVuVA0/edit?tab=t.0#heading=h.ljww281maxj1
>>> >> > > >
>>> >> > >
>>> >> > > Your feedback and ideas are welcome to refine this feature.
>>> >> > >
>>> >> > >
>>> >> > > Thanks,
>>> >> > > Weiqing
>>> >> > >
>>> >> >
>>> >>
>>>
>>

Reply via email to