Re: [DISCUSS] FLIP-570: Support Runtime Data Sampling for Operators with WebUI Visualization

xingsuo-zbz Thu, 09 Apr 2026 06:15:47 -0700

It is a very helpful feature in our production. But I have one question: 
Volatile Read on the Hot Path — Could It Become a Bottleneck in 
Extreme-Throughput Scenarios?



Best
xingsuo-zbz

在 2026-04-09 18:51:12，"Jiangang Liu" <[email protected]> 写道：
>Thanks, Look_Y_Y. Excellent question — the unreliability of toString() is
>precisely what we focused our defensive design around. We employ *three
>layers of defense-in-depth*:
>
>   1.
>
>   *Smart detection + skip*: We use ClassValue<Boolean> for reflection
>   caching to *detect at class-load time* whether a class overrides
>   toString(). For classes that don't (which would output ClassName@hashCode),
>   we directly return a descriptive placeholder such as [MyEvent - no
>   toString() override], avoiding meaningless output. This detection is
>   O(1) and computed only once per class.
>   2.
>
>   *Time budget mechanism*: The tostring-budget-ms configuration (default
>   50ms/s) ensures that the *cumulative time* spent in toString() calls
>   does not exceed 50ms per second. Once the budget is exhausted, remaining
>   records within that second have their toString() skipped and are marked
>   as [toString() budget exceeded]. Even if one toString() call takes
>   200ms, it blocks the mailbox thread at most once — no further calls are
>   made for the rest of that second.
>   3.
>
>   *Complete exception isolation*: All toString() calls are wrapped in
>   try-catch. Exceptions are logged at DEBUG level, and the record content is
>   replaced with [toString() threw ExceptionType: message]. *Critically*,
>   super.collectAndCheckIfChained() — the normal record forwarding — *executes
>   before the sampling logic* (forward-first). So even if toString() throws
>   an exception or triggers an OOM, the data has already been forwarded
>   downstream. Business processing is never affected.
>   4.
>
>   *Regarding BinaryRowData and other binary formats*: This falls under the
>   dedicated Table/SQL optimization, which we explicitly list as the first
>   item in Future Work (Schema-Aware Formatting). In the initial version,
>   toString() will output a hex digest with a type hint — admittedly not
>   user-friendly for SQL scenarios, but it produces no errors or risks. This
>   also leaves room for follow-up community contributions.
>
>Best regards,
>Jiangang Liu
>
>Look_Y_Y <[email protected]> 于2026年4月9日周四 17:54写道：
>
>> I like the design. But one question: Relying on `toString()` in Production
>> — What About Exceptions, Deadlocks, and Meaningless Output?
>>
>> > Relying on `toString()` for data display in production jobs raises
>> several concerns: (1) Many user-defined POJOs either don't override
>> `toString()` (showing `com.foo.MyEvent@3a1b2c`) or have buggy
>> implementations that could throw exceptions or even deadlock (e.g.,
>> circular references with lazy-loading ORMs). (2) Flink's internal types
>> like `BinaryRowData` produce unreadable binary dumps. (3) Calling
>> `toString()` on user objects on the **mailbox thread** means a poorly
>> implemented `toString()` could block record processing. How do you ensure
>> this doesn't become a reliability risk in production?
>>
>> > 2026年4月9日 17:27，Jiangang Liu <[email protected]> 写道：
>> >
>> > A detail performance test is as follows:
>> >
>> https://docs.google.com/document/d/1NI5HCfnsWs9xyQ4MrdY6bY89phbegwS-JUPHA8kHKd4/edit?usp=sharing
>> >
>> > Jiangang Liu <[email protected]> 于2026年4月8日周三 17:36写道：
>> >
>> >> Hi, xiongraorao. Thanks for your question. We addressed this scalability
>> >> scenario systematically in the design:
>> >>
>> >>   1. *Multi-layer capacity limits ensure bounded memory*:
>> >>
>> >>
>> >>   -
>> >>
>> >>   At most *1,000 records* per subtask (BoundedSampleBuffer)
>> >>   -
>> >>
>> >>   At most *5,000 total records* per response (Coordinator applies
>> >>   proportional fair truncation)
>> >>   -
>> >>
>> >>   At most *10,000 characters* per record (max-record-length, truncated
>> >>   if exceeded)
>> >>   -
>> >>
>> >>   At most *5 concurrent sampling rounds* per JM (excess requests are
>> >>   rejected immediately with TOO_MANY_CONCURRENT_ROUNDS)
>> >>
>> >> Worst-case estimate: 5 concurrent rounds × 5,000 records × 10KB =
>> *250MB*.
>> >> In practice, toString() output averages far less than 10KB, and 5
>> >> concurrent rounds means only 5 vertices are being sampled at any given
>> >> moment. Given that JMs are typically configured with 4–8GB of heap
>> memory,
>> >> this upper bound is safe.
>> >>
>> >>   2.
>> >>
>> >>   *Guava Cache with expireAfterWrite*: The cache uses refresh-interval
>> >>   (default 60s) as TTL and entries expire automatically. There is no
>> >>   unbounded accumulation. Even if all 500 vertices have been sampled at
>> some
>> >>   point, caches with no new requests are GC'd after 60 seconds. In
>> practice,
>> >>   the number of vertices actively viewed in the WebUI at any given
>> moment is
>> >>   typically 3–5.
>> >>   3.
>> >>
>> >>   *Room to tighten further*: If the community feels it's necessary, we
>> >>   can add a rest.data-sampling.max-cached-vertices configuration (e.g.,
>> >>   default 20) using Guava's maximumSize to cap the number of cached
>> >>   vertices. This is trivial to add in the initial version.
>> >>   4.
>> >>
>> >>   *JM Failover*: Sampling results are ephemeral diagnostic data and do
>> >>   not require persistence. Upon JM failover, all in-flight rounds'
>> >>   CompletableFutures fail naturally (TM connections are severed), and
>> >>   the cache is destroyed along with the old JM instance. The new JM
>> starts
>> >>   with a clean state — users simply click again to trigger a fresh
>> sampling
>> >>   round. This is fully consistent with how FlameGraph behaves during JM
>> >>   failover, a pattern already validated in production.
>> >>
>> >>
>> >> 熊饶饶 <[email protected]> 于2026年4月8日周三 17:03写道：
>> >>
>> >>> Thanks for the flip. It is useful for users. I have only one question:
>> JM
>> >>> Memory Pressure Under High-Concurrency Sampling — Could It Cause OOM in
>> >>> Large-Scale Jobs?
>> >>>
>> >>>> 2026年3月24日 12:24，Jiangang Liu <[email protected]> 写道：
>> >>>>
>> >>>> Hi everyone,
>> >>>>
>> >>>> I would like to start a discussion on FLIP-570: Support Runtime Data
>> >>>> Sampling for Operators with WebUI Visualization [1].
>> >>>>
>> >>>> Inspecting intermediate data in a running Flink job is a common need
>> >>>> across development, data exploration, and troubleshooting. Today, the
>> >>>> only options are modifying the job (print() sink, log statements — all
>> >>>> require a restart) or deploying external infrastructure (extra Kafka
>> >>>> topics, debug sinks). Both are slow and disruptive for what is
>> >>>> essentially a "what does the data look like here?" question.
>> >>>>
>> >>>> FLIP-570 proposes native runtime data sampling, following the same
>> >>>> proven architecture pattern as FlameGraph (FLINK-13550). The key
>> ideas:
>> >>>>
>> >>>>  1. On-demand, round-scoped sampling at the output of any job vertex,
>> >>>>  triggered via REST API without job restart or topology modification.
>> >>>>  2. A new "Data Sample" tab in the WebUI with auto-polling, subtask
>> >>>>  selector, and status-driven display.
>> >>>>  3. Minimal overhead: zero when disabled; ~1.6% for the lightest ETL
>> >>>>  workloads when enabled-idle; <0.5% for typical production workloads.
>> >>>>  4. Safety by default: disabled by default, with rate limiting, time
>> >>>>  budget, buffer caps, and round-scoped auto-disable.
>> >>>>
>> >>>> For more details, please refer to the FLIP [1].
>> >>>>
>> >>>> Looking forward to your feedback and thoughts!
>> >>>>
>> >>>> [1]
>> >>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-570%3A+Support+Runtime+Data+Sampling+for+Operators+with+WebUI+Visualization
>> >>>>
>> >>>> Best regards,
>> >>>> Jiangang Liu
>> >>>
>> >>>
>>
>>

Re: [DISCUSS] FLIP-570: Support Runtime Data Sampling for Operators with WebUI Visualization

Reply via email to