Hi, xiongraorao. Thanks for your question. We addressed this scalability scenario systematically in the design:
1. *Multi-layer capacity limits ensure bounded memory*: - At most *1,000 records* per subtask (BoundedSampleBuffer) - At most *5,000 total records* per response (Coordinator applies proportional fair truncation) - At most *10,000 characters* per record (max-record-length, truncated if exceeded) - At most *5 concurrent sampling rounds* per JM (excess requests are rejected immediately with TOO_MANY_CONCURRENT_ROUNDS) Worst-case estimate: 5 concurrent rounds × 5,000 records × 10KB = *250MB*. In practice, toString() output averages far less than 10KB, and 5 concurrent rounds means only 5 vertices are being sampled at any given moment. Given that JMs are typically configured with 4–8GB of heap memory, this upper bound is safe. 2. *Guava Cache with expireAfterWrite*: The cache uses refresh-interval (default 60s) as TTL and entries expire automatically. There is no unbounded accumulation. Even if all 500 vertices have been sampled at some point, caches with no new requests are GC'd after 60 seconds. In practice, the number of vertices actively viewed in the WebUI at any given moment is typically 3–5. 3. *Room to tighten further*: If the community feels it's necessary, we can add a rest.data-sampling.max-cached-vertices configuration (e.g., default 20) using Guava's maximumSize to cap the number of cached vertices. This is trivial to add in the initial version. 4. *JM Failover*: Sampling results are ephemeral diagnostic data and do not require persistence. Upon JM failover, all in-flight rounds' CompletableFutures fail naturally (TM connections are severed), and the cache is destroyed along with the old JM instance. The new JM starts with a clean state — users simply click again to trigger a fresh sampling round. This is fully consistent with how FlameGraph behaves during JM failover, a pattern already validated in production. 熊饶饶 <[email protected]> 于2026年4月8日周三 17:03写道: > Thanks for the flip. It is useful for users. I have only one question: JM > Memory Pressure Under High-Concurrency Sampling — Could It Cause OOM in > Large-Scale Jobs? > > > 2026年3月24日 12:24,Jiangang Liu <[email protected]> 写道: > > > > Hi everyone, > > > > I would like to start a discussion on FLIP-570: Support Runtime Data > > Sampling for Operators with WebUI Visualization [1]. > > > > Inspecting intermediate data in a running Flink job is a common need > > across development, data exploration, and troubleshooting. Today, the > > only options are modifying the job (print() sink, log statements — all > > require a restart) or deploying external infrastructure (extra Kafka > > topics, debug sinks). Both are slow and disruptive for what is > > essentially a "what does the data look like here?" question. > > > > FLIP-570 proposes native runtime data sampling, following the same > > proven architecture pattern as FlameGraph (FLINK-13550). The key ideas: > > > > 1. On-demand, round-scoped sampling at the output of any job vertex, > > triggered via REST API without job restart or topology modification. > > 2. A new "Data Sample" tab in the WebUI with auto-polling, subtask > > selector, and status-driven display. > > 3. Minimal overhead: zero when disabled; ~1.6% for the lightest ETL > > workloads when enabled-idle; <0.5% for typical production workloads. > > 4. Safety by default: disabled by default, with rate limiting, time > > budget, buffer caps, and round-scoped auto-disable. > > > > For more details, please refer to the FLIP [1]. > > > > Looking forward to your feedback and thoughts! > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-570%3A+Support+Runtime+Data+Sampling+for+Operators+with+WebUI+Visualization > > > > Best regards, > > Jiangang Liu > >
