I like the design. But one question: Relying on `toString()` in Production — 
What About Exceptions, Deadlocks, and Meaningless Output?

> Relying on `toString()` for data display in production jobs raises several 
> concerns: (1) Many user-defined POJOs either don't override `toString()` 
> (showing `com.foo.MyEvent@3a1b2c`) or have buggy implementations that could 
> throw exceptions or even deadlock (e.g., circular references with 
> lazy-loading ORMs). (2) Flink's internal types like `BinaryRowData` produce 
> unreadable binary dumps. (3) Calling `toString()` on user objects on the 
> **mailbox thread** means a poorly implemented `toString()` could block record 
> processing. How do you ensure this doesn't become a reliability risk in 
> production?

> 2026年4月9日 17:27,Jiangang Liu <[email protected]> 写道:
> 
> A detail performance test is as follows:
> https://docs.google.com/document/d/1NI5HCfnsWs9xyQ4MrdY6bY89phbegwS-JUPHA8kHKd4/edit?usp=sharing
> 
> Jiangang Liu <[email protected]> 于2026年4月8日周三 17:36写道:
> 
>> Hi, xiongraorao. Thanks for your question. We addressed this scalability
>> scenario systematically in the design:
>> 
>>   1. *Multi-layer capacity limits ensure bounded memory*:
>> 
>> 
>>   -
>> 
>>   At most *1,000 records* per subtask (BoundedSampleBuffer)
>>   -
>> 
>>   At most *5,000 total records* per response (Coordinator applies
>>   proportional fair truncation)
>>   -
>> 
>>   At most *10,000 characters* per record (max-record-length, truncated
>>   if exceeded)
>>   -
>> 
>>   At most *5 concurrent sampling rounds* per JM (excess requests are
>>   rejected immediately with TOO_MANY_CONCURRENT_ROUNDS)
>> 
>> Worst-case estimate: 5 concurrent rounds × 5,000 records × 10KB = *250MB*.
>> In practice, toString() output averages far less than 10KB, and 5
>> concurrent rounds means only 5 vertices are being sampled at any given
>> moment. Given that JMs are typically configured with 4–8GB of heap memory,
>> this upper bound is safe.
>> 
>>   2.
>> 
>>   *Guava Cache with expireAfterWrite*: The cache uses refresh-interval
>>   (default 60s) as TTL and entries expire automatically. There is no
>>   unbounded accumulation. Even if all 500 vertices have been sampled at some
>>   point, caches with no new requests are GC'd after 60 seconds. In practice,
>>   the number of vertices actively viewed in the WebUI at any given moment is
>>   typically 3–5.
>>   3.
>> 
>>   *Room to tighten further*: If the community feels it's necessary, we
>>   can add a rest.data-sampling.max-cached-vertices configuration (e.g.,
>>   default 20) using Guava's maximumSize to cap the number of cached
>>   vertices. This is trivial to add in the initial version.
>>   4.
>> 
>>   *JM Failover*: Sampling results are ephemeral diagnostic data and do
>>   not require persistence. Upon JM failover, all in-flight rounds'
>>   CompletableFutures fail naturally (TM connections are severed), and
>>   the cache is destroyed along with the old JM instance. The new JM starts
>>   with a clean state — users simply click again to trigger a fresh sampling
>>   round. This is fully consistent with how FlameGraph behaves during JM
>>   failover, a pattern already validated in production.
>> 
>> 
>> 熊饶饶 <[email protected]> 于2026年4月8日周三 17:03写道:
>> 
>>> Thanks for the flip. It is useful for users. I have only one question: JM
>>> Memory Pressure Under High-Concurrency Sampling — Could It Cause OOM in
>>> Large-Scale Jobs?
>>> 
>>>> 2026年3月24日 12:24,Jiangang Liu <[email protected]> 写道:
>>>> 
>>>> Hi everyone,
>>>> 
>>>> I would like to start a discussion on FLIP-570: Support Runtime Data
>>>> Sampling for Operators with WebUI Visualization [1].
>>>> 
>>>> Inspecting intermediate data in a running Flink job is a common need
>>>> across development, data exploration, and troubleshooting. Today, the
>>>> only options are modifying the job (print() sink, log statements — all
>>>> require a restart) or deploying external infrastructure (extra Kafka
>>>> topics, debug sinks). Both are slow and disruptive for what is
>>>> essentially a "what does the data look like here?" question.
>>>> 
>>>> FLIP-570 proposes native runtime data sampling, following the same
>>>> proven architecture pattern as FlameGraph (FLINK-13550). The key ideas:
>>>> 
>>>>  1. On-demand, round-scoped sampling at the output of any job vertex,
>>>>  triggered via REST API without job restart or topology modification.
>>>>  2. A new "Data Sample" tab in the WebUI with auto-polling, subtask
>>>>  selector, and status-driven display.
>>>>  3. Minimal overhead: zero when disabled; ~1.6% for the lightest ETL
>>>>  workloads when enabled-idle; <0.5% for typical production workloads.
>>>>  4. Safety by default: disabled by default, with rate limiting, time
>>>>  budget, buffer caps, and round-scoped auto-disable.
>>>> 
>>>> For more details, please refer to the FLIP [1].
>>>> 
>>>> Looking forward to your feedback and thoughts!
>>>> 
>>>> [1]
>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-570%3A+Support+Runtime+Data+Sampling+for+Operators+with+WebUI+Visualization
>>>> 
>>>> Best regards,
>>>> Jiangang Liu
>>> 
>>> 

Reply via email to