Hey Danny,
Thanks for the feedback.

1. Instead of adding a FlameGraph specific REST API did you consider adding
a more general config API? Similar to that of the dynamical job
configuration [1] endpoint but for cluster configs instead of job? We could
add an allow list of supported config options and start with Flamegraph.
This would allow other configs to use the API in the future without adding
more APIs.

I think FLIP-530 works for job configs because jobs can safely restart from
checkpoints with updated config, scoped to a single job. Cluster configs,
however, lack a restart boundary — changes must be applied to shared
infrastructure without downtime, affecting all jobs. I analyzed the cluster
config categories; only a handful (like *rest.flamegraph.*,
web.refresh-interval, cluster.thread-dump.stacktrace-max-depth*) are safe
to update dynamically. Most (*e.g., rpc.port,
taskmanager.memory.process.size, security.ssl.**) are startup-only or
deeply cached and would require service restarts.
And  the flamegraph configs are isolated, already dynamic in design. Hence
i couldn't find a common pattern config.

2. nit: As for the UI, I would prefer for the settings to take up less
space. The new options are at the top of the view, even when not expanded.

Makes sense. Will update it.


On Tue, Aug 19, 2025 at 6:12 PM Poorvank Bhatia <puravbhat...@gmail.com>
wrote:

> Hello Gyula,
> Thanks for the suggestion! Enabling flamegraphs by default could indeed
> improve visibility, and in many stable environments, the passive overhead
> is minimal. However, based on our use cases, there are a few practical
> reasons we’ve opted to keep them disabled by default:
>
>    1. Memory & GC Behavior During Sampling:  When the flamegraph tab is
>    opened in the UI, VertexThreadInfoTracker begins continuous stack trace
>    sampling every statsRefreshInterval (default 60s), with each sample
>    containing ThreadInfoSample objects containing full StackTraceElement[]
>    arrays. This sampling introduces non-trivial memory pressure, especially in
>    high-parallelism scenarios. So data for all the stack traces across task
>    managers is then stored on the JobManager heap within
>    VertexThreadInfoTracker (with each entry
>    containing ThreadInfoSample instances). We observed that this structure
>    accumulates rapidly with high parallelism (>1000) and deep stack sampling,
>    causing memory issues in JM. (Memory retention until cleanUpInterval
>    (default 10 minutes)). The major issue is that JM OOM affects the entire
>    cluster availability :(
>    2. Flink doesn't persist flamegraph data: Flamegraph samples are held
>    entirely in memory. For future iterations, we’re considering storing them
>    temporarily to local disk or external storage (but that requires
>    significant changes), which would decouple the UI tab from memory pressure.
>    3. Config values still come from RestOptions: Even if we enable
>    flamegraphs by default, the sampling parameters (e.g., numSamples,
>    stackDepth, delayBetweenSamples) are still initialized via RestOptions on
>    JobManager startup. Without dynamic REST reconfiguration (as proposed),
>    users would still need to restart the cluster to change them.
>
> Let me know if that makes sense.
>
> On Tue, Aug 19, 2025 at 4:59 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
>> Hey!
>> Instead of adding new logic for this, can we make the flamegraphs enabled
>> by default?
>>
>> Based on my experience almost everyone wants it enabled , doesn't seem to
>> add any overhead when they are not actually checked on the UI
>>
>> Cheers,
>> Gyula
>>
>> On Tue, Aug 19, 2025 at 1:27 PM Danny Cranmer <dannycran...@apache.org>
>> wrote:
>>
>> > Hello Poorvank,
>> >
>> > Thanks for driving this, I can understand how dynamically enabling
>> > FlameGraphs can be powerful, so +1 on the general idea.
>> >
>> > 1. Instead of adding a FlameGraph specific REST API did you consider
>> adding
>> > a more general config API? Similar to that of the dynamical job
>> > configuration [1] endpoint but for cluster configs instead of job? We
>> could
>> > add an allow list of supported config options and start with Flamegraph.
>> > This would allow other configs to use the API in the future without
>> adding
>> > more APIs.
>> > 2. nit: As for the UI, I would prefer for the settings to take up less
>> > space. The new options are at the top of the view, even when not
>> expanded.
>> >
>> > Thanks,
>> > Danny
>> >
>> > [1]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
>> >
>> > On Tue, Aug 12, 2025 at 4:31 AM Poorvank Bhatia <puravbhat...@gmail.com
>> >
>> > wrote:
>> >
>> > > Hi all,
>> > >
>> > > I would like to open a discussion proposing the ability to enable
>> > > flamegraphs at runtime and make their configuration i.e number of
>> > samples,
>> > > delay between samples, and stack depth *dynamically adjustable via the
>> > Web
>> > > UI*, without requiring any job or cluster restarts.
>> > >
>> > > As of now, enabling flamegraphs requires setting
>> > > *rest.flamegraph.enabled=true* and restarting the Job. This is not
>> ideal
>> > > for debugging live issues, especially in production environments.
>> > >
>> > > I discussed this idea offline with Roman Khachatryan (author of
>> FLIP-530
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
>> > > >),
>> > > Rui Fan, and Arvid Heise. While Rui noted that this could potentially
>> > align
>> > > with FLIP-530’s direction, Roman confirmed that it’s better handled
>> as a
>> > > separate effort, since FLIP-530
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
>> > > >
>> > > is scoped to job-level config, whereas this proposal addresses
>> > > cluster-level observability via RestOptions.
>> > >
>> > > For Design Details, Please refer: Dynamic Flamegraph via UI
>> > > <
>> > >
>> >
>> https://docs.google.com/document/d/1A9fLFgXMGxQQn6X8WCv7mLL21AnLqrDFvLSHnUg8rLA/edit?tab=t.0#heading=h.s351fc464ma6
>> > > >
>> > >
>> > > I’ve attached a short demo to help visualize the proposed feature and
>> > > gather feedback. Demo
>> > > <
>> > >
>> >
>> https://drive.google.com/file/d/1iik6aOc2uc9sFlHFlT8YDX5TKFdoD15u/view?usp=sharing
>> > > >
>> > >
>> > > Looking forward to your thoughts.
>> > >
>> > > Regards,
>> > >
>> > > Poorvank Bhatia
>> > >
>> >
>>
>

Reply via email to