A detail performance test is as follows:
https://docs.google.com/document/d/1NI5HCfnsWs9xyQ4MrdY6bY89phbegwS-JUPHA8kHKd4/edit?usp=sharing

Jiangang Liu <[email protected]> 于2026年4月8日周三 17:36写道:

> Hi, xiongraorao. Thanks for your question. We addressed this scalability
> scenario systematically in the design:
>
>    1. *Multi-layer capacity limits ensure bounded memory*:
>
>
>    -
>
>    At most *1,000 records* per subtask (BoundedSampleBuffer)
>    -
>
>    At most *5,000 total records* per response (Coordinator applies
>    proportional fair truncation)
>    -
>
>    At most *10,000 characters* per record (max-record-length, truncated
>    if exceeded)
>    -
>
>    At most *5 concurrent sampling rounds* per JM (excess requests are
>    rejected immediately with TOO_MANY_CONCURRENT_ROUNDS)
>
> Worst-case estimate: 5 concurrent rounds × 5,000 records × 10KB = *250MB*.
> In practice, toString() output averages far less than 10KB, and 5
> concurrent rounds means only 5 vertices are being sampled at any given
> moment. Given that JMs are typically configured with 4–8GB of heap memory,
> this upper bound is safe.
>
>    2.
>
>    *Guava Cache with expireAfterWrite*: The cache uses refresh-interval
>    (default 60s) as TTL and entries expire automatically. There is no
>    unbounded accumulation. Even if all 500 vertices have been sampled at some
>    point, caches with no new requests are GC'd after 60 seconds. In practice,
>    the number of vertices actively viewed in the WebUI at any given moment is
>    typically 3–5.
>    3.
>
>    *Room to tighten further*: If the community feels it's necessary, we
>    can add a rest.data-sampling.max-cached-vertices configuration (e.g.,
>    default 20) using Guava's maximumSize to cap the number of cached
>    vertices. This is trivial to add in the initial version.
>    4.
>
>    *JM Failover*: Sampling results are ephemeral diagnostic data and do
>    not require persistence. Upon JM failover, all in-flight rounds'
>    CompletableFutures fail naturally (TM connections are severed), and
>    the cache is destroyed along with the old JM instance. The new JM starts
>    with a clean state — users simply click again to trigger a fresh sampling
>    round. This is fully consistent with how FlameGraph behaves during JM
>    failover, a pattern already validated in production.
>
>
> 熊饶饶 <[email protected]> 于2026年4月8日周三 17:03写道:
>
>> Thanks for the flip. It is useful for users. I have only one question: JM
>> Memory Pressure Under High-Concurrency Sampling — Could It Cause OOM in
>> Large-Scale Jobs?
>>
>> > 2026年3月24日 12:24,Jiangang Liu <[email protected]> 写道:
>> >
>> > Hi everyone,
>> >
>> > I would like to start a discussion on FLIP-570: Support Runtime Data
>> > Sampling for Operators with WebUI Visualization [1].
>> >
>> > Inspecting intermediate data in a running Flink job is a common need
>> > across development, data exploration, and troubleshooting. Today, the
>> > only options are modifying the job (print() sink, log statements — all
>> > require a restart) or deploying external infrastructure (extra Kafka
>> > topics, debug sinks). Both are slow and disruptive for what is
>> > essentially a "what does the data look like here?" question.
>> >
>> > FLIP-570 proposes native runtime data sampling, following the same
>> > proven architecture pattern as FlameGraph (FLINK-13550). The key ideas:
>> >
>> >   1. On-demand, round-scoped sampling at the output of any job vertex,
>> >   triggered via REST API without job restart or topology modification.
>> >   2. A new "Data Sample" tab in the WebUI with auto-polling, subtask
>> >   selector, and status-driven display.
>> >   3. Minimal overhead: zero when disabled; ~1.6% for the lightest ETL
>> >   workloads when enabled-idle; <0.5% for typical production workloads.
>> >   4. Safety by default: disabled by default, with rate limiting, time
>> >   budget, buffer caps, and round-scoped auto-disable.
>> >
>> > For more details, please refer to the FLIP [1].
>> >
>> > Looking forward to your feedback and thoughts!
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-570%3A+Support+Runtime+Data+Sampling+for+Operators+with+WebUI+Visualization
>> >
>> > Best regards,
>> > Jiangang Liu
>>
>>

Reply via email to