Re: [DISCUSS] FLIP-570: Support Runtime Data Sampling for Operators with WebUI Visualization

Jiangang Liu Wed, 08 Apr 2026 02:37:44 -0700

Hi, xiongraorao. Thanks for your question. We addressed this scalability
scenario systematically in the design:


   1. *Multi-layer capacity limits ensure bounded memory*:


   -

   At most *1,000 records* per subtask (BoundedSampleBuffer)
   -

   At most *5,000 total records* per response (Coordinator applies
   proportional fair truncation)
   -

   At most *10,000 characters* per record (max-record-length, truncated if
   exceeded)
   -

   At most *5 concurrent sampling rounds* per JM (excess requests are
   rejected immediately with TOO_MANY_CONCURRENT_ROUNDS)

Worst-case estimate: 5 concurrent rounds × 5,000 records × 10KB = *250MB*.
In practice, toString() output averages far less than 10KB, and 5
concurrent rounds means only 5 vertices are being sampled at any given
moment. Given that JMs are typically configured with 4–8GB of heap memory,
this upper bound is safe.

   2.

   *Guava Cache with expireAfterWrite*: The cache uses refresh-interval
   (default 60s) as TTL and entries expire automatically. There is no
   unbounded accumulation. Even if all 500 vertices have been sampled at some
   point, caches with no new requests are GC'd after 60 seconds. In practice,
   the number of vertices actively viewed in the WebUI at any given moment is
   typically 3–5.
   3.

   *Room to tighten further*: If the community feels it's necessary, we can
   add a rest.data-sampling.max-cached-vertices configuration (e.g.,
   default 20) using Guava's maximumSize to cap the number of cached
   vertices. This is trivial to add in the initial version.
   4.

   *JM Failover*: Sampling results are ephemeral diagnostic data and do not
   require persistence. Upon JM failover, all in-flight rounds'
   CompletableFutures fail naturally (TM connections are severed), and the
   cache is destroyed along with the old JM instance. The new JM starts with a
   clean state — users simply click again to trigger a fresh sampling round.
   This is fully consistent with how FlameGraph behaves during JM failover, a
   pattern already validated in production.


熊饶饶 <[email protected]> 于2026年4月8日周三 17:03写道：

> Thanks for the flip. It is useful for users. I have only one question: JM
> Memory Pressure Under High-Concurrency Sampling — Could It Cause OOM in
> Large-Scale Jobs?
>
> > 2026年3月24日 12:24，Jiangang Liu <[email protected]> 写道：
> >
> > Hi everyone,
> >
> > I would like to start a discussion on FLIP-570: Support Runtime Data
> > Sampling for Operators with WebUI Visualization [1].
> >
> > Inspecting intermediate data in a running Flink job is a common need
> > across development, data exploration, and troubleshooting. Today, the
> > only options are modifying the job (print() sink, log statements — all
> > require a restart) or deploying external infrastructure (extra Kafka
> > topics, debug sinks). Both are slow and disruptive for what is
> > essentially a "what does the data look like here?" question.
> >
> > FLIP-570 proposes native runtime data sampling, following the same
> > proven architecture pattern as FlameGraph (FLINK-13550). The key ideas:
> >
> >   1. On-demand, round-scoped sampling at the output of any job vertex,
> >   triggered via REST API without job restart or topology modification.
> >   2. A new "Data Sample" tab in the WebUI with auto-polling, subtask
> >   selector, and status-driven display.
> >   3. Minimal overhead: zero when disabled; ~1.6% for the lightest ETL
> >   workloads when enabled-idle; <0.5% for typical production workloads.
> >   4. Safety by default: disabled by default, with rate limiting, time
> >   budget, buffer caps, and round-scoped auto-disable.
> >
> > For more details, please refer to the FLIP [1].
> >
> > Looking forward to your feedback and thoughts!
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-570%3A+Support+Runtime+Data+Sampling+for+Operators+with+WebUI+Visualization
> >
> > Best regards,
> > Jiangang Liu
>
>

Re: [DISCUSS] FLIP-570: Support Runtime Data Sampling for Operators with WebUI Visualization

Reply via email to